Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering

arXiv cs.CL Papers

Summary

Granuscore is a reference-free measure of granularity for text analysis and question answering. It uses hierarchical embedding spaces to capture fine-grained vs. coarse language and demonstrates consistent differences in model behavior across QA benchmarks.

arXiv:2605.26620v1 Announce Type: new Abstract: Natural language conveys information at varying levels of granularity, from fine-grained references to broad descriptions. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity. We introduce Granuscore, a reference-free measure of granularity that leverages structural properties of a hierarchical embedding space. Granuscore reliably recovers hierarchical orderings on the Granola-EQ dataset and captures expected differences in granularity across discourse contexts. Across domains, we further show that Granuscore explains non-linear variation in sentence specificity beyond sentence length. Finally, we apply Granuscore to four question-answering benchmarks and analyze how granularity differs for questions, gold answers, and model outputs across response outcomes. The analysis reveals consistent differences in model behavior and provides a principled lens for characterizing the difficulty of QA datasets. Together, the results position Granuscore as a scalable, broadly applicable tool for analyzing granularity in text.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:07 AM

# Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering
Source: [https://arxiv.org/html/2605.26620](https://arxiv.org/html/2605.26620)
Lukas Ellinger, Alexander Fichtl, Miriam Anschütz, and Georg Groh School for Computation, Information and Technology Technical University of Munich, Germany \{[lukas\.ellinger](https://arxiv.org/html/2605.26620v1/mailto:[email protected]), miriam\.anschuetz, alexander\.fichtl\}@tum\.de, grohg@cit\.tum\.de

###### Abstract

Natural language conveys information at varying levels of granularity, from fine\-grained references to broad descriptions\. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity\. We introduceGranuscore, a reference\-free measure of granularity that leverages structural properties of a hierarchical embedding space\. Granuscore reliably recovers hierarchical orderings on theGranola\-EQdataset and captures expected differences in granularity across discourse contexts\. Across domains, we further show that Granuscore explains non\-linear variation in sentence specificity beyond sentence length\. Finally, we apply Granuscore to four question\-answering benchmarks and analyze how granularity differs for questions, gold answers, and model outputs across response outcomes\. The analysis reveals consistent differences in model behavior and provides a principled lens for characterizing the difficulty of QA datasets\. Together, the results position Granuscore as a scalable, broadly applicable tool for analyzing granularity in text\.

![[Uncaptioned image]](https://arxiv.org/html/2605.26620v1/assets/granu.png)Granuscore: A Reference\-Free Measure of Granularity for Text Analysis and Question Answering

Lukas Ellinger, Alexander Fichtl, Miriam Anschütz, and Georg GrohSchool for Computation, Information and TechnologyTechnical University of Munich, Germany\{[lukas\.ellinger](https://arxiv.org/html/2605.26620v1/mailto:[email protected]), miriam\.anschuetz, alexander\.fichtl\}@tum\.de, grohg@cit\.tum\.de

## 1Introduction

Natural language varies not only in*what*information is conveyed, but also in*how coarsely or finely*that information is expressed\. Consider the sentences in[Figure 1](https://arxiv.org/html/2605.26620#S1.F1)\. A speaker may refer to a person as*Tony Hawk*,*a skateboarder*, or*a sportsman*, and may locate an event in*San Diego*,*California*, or*the United States*\. These alternatives preserve the underlying fact while referring to it at different levels\. We refer to this dimension as*granularity*: the level of abstraction at which entities or events are represented in language\(Mulkar\-Mehtaet al\.,[2011](https://arxiv.org/html/2605.26620#bib.bib50); Roschet al\.,[1976](https://arxiv.org/html/2605.26620#bib.bib49); Hobbs,[1985](https://arxiv.org/html/2605.26620#bib.bib56)\)\.

Granularity is not incidental: speakers adapt the level of abstraction of their descriptions depending on conversational context and task requirements\(Mulkar\-Mehtaet al\.,[2011](https://arxiv.org/html/2605.26620#bib.bib50); Hobbs,[1985](https://arxiv.org/html/2605.26620#bib.bib56)\)\. When uncertain, speakers often prefer coarser fact descriptions that remain informative without overcommitting\. Conversely, when common ground is established, more fine\-grained references become appropriate\(Yonaet al\.,[2024](https://arxiv.org/html/2605.26620#bib.bib4)\)\. Granularity should therefore be understood as a deliberate strategy that balances reliability and audience expectations\.

![Refer to caption](https://arxiv.org/html/2605.26620v1/x1.png)Figure 1:Sentences with referential units varying in granularity\. Units that differ across sentences are underlined\. Replacing fine\-grained terms with coarser alternatives increases sentence granularity: lower Granuscores indicate finer expressions\.Prior work suggests that linguistic granularity affects how information is perceived and used\. In dialogue systems, too fine\-grained or coarse responses can reduce user satisfaction\(Adiwardanaet al\.,[2020](https://arxiv.org/html/2605.26620#bib.bib7); Thoppilanet al\.,[2022](https://arxiv.org/html/2605.26620#bib.bib30)\)\. Similarly, in simplified language settings, controlling granularity is important for accessibility and comprehension as it reduces the cognitive load\(OECD,[2024](https://arxiv.org/html/2605.26620#bib.bib48); Anschützet al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib32)\)\. However, studying such effects systematically is difficult because existing approaches do not provide a scalable, reference\-free measure of granularity at the sentence level\.

Our contributions are as follows:

- •We introduceGranuscore, a reference\-free measure of granularity that exploits structural properties of a hierarchical embedding space\.
- •We validate Granuscore both empirically and conceptually\. It reliably recovers human\-annotated orderings onGranola\-EQ\(Yonaet al\.,[2024](https://arxiv.org/html/2605.26620#bib.bib4)\)and captures expected granularity differences across discourse contexts\.
- •We show that across domains Granuscore explains non\-linear variation in sentence specificity beyond sentence length\.
- •We demonstrate the practical relevance of Granuscore for question answering\. Evaluating six language models on four QA benchmarks, we identify consistent differences in granularity between questions, gold answers, and model outputs across response outcomes\. These patterns provide a principled lens for characterizing QA dataset difficulty and analyzing model behavior\.
- •We release Granuscore as a[pip package](https://github.com/lukasellinger/granuscore)to ensure reproducibility and enable its usage for further research or production\.

## 2Background and Related Work

#### Granularity

Mulkar\-Mehtaet al\.\([2011](https://arxiv.org/html/2605.26620#bib.bib50)\)describe granularity in natural language as shifts between coarse and fine descriptions, where higher\-level representations abstract from more detailed components\. Related perspectives appear in cognitive science, where concepts are organized at different levels of abstraction within taxonomies\(Roschet al\.,[1976](https://arxiv.org/html/2605.26620#bib.bib49)\)\. Further, foundational work byHobbs \([1985](https://arxiv.org/html/2605.26620#bib.bib56)\)argues that intelligent reasoning requires representing the world at multiple levels of granularity and switching between them as needed, allowing complex phenomena to be modeled through simpler abstractions\.

A related property is*term specificity*, which refers to identifying index terms distinguishing one class of documents from others\. In particular,Kim \([2006](https://arxiv.org/html/2605.26620#bib.bib10)\)describe*hierarchical specificity*as a term’s position within a generic–specific hierarchy, where narrower terms correspond to more specific concepts, matching the notion of granularity\.

We capture these ideas using structural properties of a hierarchical embedding space\. Unlike approaches relying on manually constructed hierarchies, this enables estimating granularity without being restricted to predefined vocabularies\.

#### Sentence Specificity

*Sentence specificity*refers to the extent to which a sentence conveys concrete information and supports consistent interpretation across readers\(Liet al\.,[2016](https://arxiv.org/html/2605.26620#bib.bib11); Koet al\.,[2019](https://arxiv.org/html/2605.26620#bib.bib16)\)\. Prior work has shown its relevance for reading comprehension\(Dixon,[1987](https://arxiv.org/html/2605.26620#bib.bib35)\)and establishing common ground in dialogue\(Djalaliet al\.,[2011](https://arxiv.org/html/2605.26620#bib.bib34)\)\.

Although finer\-grained references often increase sentence specificity, granularity and sentence specificity capture different properties\. Sentence specificity reflects the amount of descriptive information conveyed by a sentence, whereas granularity describes the level at which referential expressions occur within a semantic hierarchy\. A sentence can therefore become more specific by adding descriptive details without changing the granularity of its referents\. For example,“The skateboarder won the competition”becomes more specific in“The skateboarder won the competition and set a new record\.”\. The referents remain at the same granularity level, but the sentence conveys more information\.

#### Granularity Evaluation

While granularity has been implicitly discussed in work on specificity, informativeness, and semantic hierarchies\(Thoppilanet al\.,[2022](https://arxiv.org/html/2605.26620#bib.bib30); Adiwardanaet al\.,[2020](https://arxiv.org/html/2605.26620#bib.bib7); Koet al\.,[2019](https://arxiv.org/html/2605.26620#bib.bib16); Liet al\.,[2016](https://arxiv.org/html/2605.26620#bib.bib11)\), existing automatic evaluations typically rely on taxonomy depth \(e\.g\., WordNet hypernym levels\(Miller,[1994](https://arxiv.org/html/2605.26620#bib.bib1)\)or hierarchical relations in knowledge graphs such as Wikidata\(Vrandečić and Krötzsch,[2014](https://arxiv.org/html/2605.26620#bib.bib53); Huanget al\.,[2023](https://arxiv.org/html/2605.26620#bib.bib15)\)\)\. However, these approaches require entities to exist in the underlying taxonomy and therefore provide limited coverage for free\-form text\. In contrast, embedding\-based approaches can operate directly on arbitrary text\.

Huanget al\.\([2023](https://arxiv.org/html/2605.26620#bib.bib15)\)propose an automatic benchmark for measuring specificity using transitive relations derived from Wikidata\. However, the induced orderings can yield unintuitive comparisons, for example, ranking*Mexico*as more granular than*Colombia*, or*historian*as more granular than*writer*\. We therefore acknowledge this dataset but refrain from using it in our experiments\.

Yonaet al\.\([2024](https://arxiv.org/html/2605.26620#bib.bib4)\)introduceGranola\-EQ, a question answering dataset with explicitly controlled answer granularity levels\. They show that standard decoding methods tend to produce overly granular and often incorrect answers\. We build on this dataset to train Granuscore and extend their analysis by applying granularity estimation to a broader set of QA datasets, studying how granularity relates to model outputs, correctness, and dataset difficulty\.

#### Training Signals for Informativeness and Interestingness

The informativeness of model responses plays a central role in user engagement and response quality\(Adiwardanaet al\.,[2020](https://arxiv.org/html/2605.26620#bib.bib7); Thoppilanet al\.,[2022](https://arxiv.org/html/2605.26620#bib.bib30)\)\. While early work relies on human annotation to supervise informativeness\(Adiwardanaet al\.,[2020](https://arxiv.org/html/2605.26620#bib.bib7); Thoppilanet al\.,[2022](https://arxiv.org/html/2605.26620#bib.bib30)\), more recent approaches use LLM\-based judges to obtain relative preference signals by comparing response pairs\(Wuet al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib9)\)\. Relatedly,Onozeki and Inaba \([2025](https://arxiv.org/html/2605.26620#bib.bib6)\)introduce interestingness as a training signal and assign scores using an LLM judge\.

In contrast to these approaches, which depend on human supervision or pairwise or model\-based judgments, Granuscore provides a reference\-free, scalable signal that measures granularity on an absolute and interpretable scale\.

## 3Granuscore

![Refer to caption](https://arxiv.org/html/2605.26620v1/x2.png)Figure 2:Granuscore pipeline: extraction of hierarchical depth \(Dist0\) and comparison to anchor entities, followed by gradient\-boosted trees and percentile calibration to produce a scalar granularity score\.Granuscore measures semantic granularity by exploiting structural properties of a hierarchical embedding space, wherelower scorescorrespond tofiner\-grained expressions\. We build on the[Hierarchy Transformer model](https://huggingface.co/Hierarchy-Transformers/HiT-MiniLM-L12-WordNetNoun)\(HiT\) proposed byChenet al\.\([2024](https://arxiv.org/html/2605.26620#bib.bib25)\), who train transformer encoders to represent hierarchical structure in a hyperbolic embedding space modeled as a Poincaré ball\. In this geometry, hierarchical relations are represented by radial distance from the origin: more specific concepts lie farther from the center, while more general concepts lie closer\. We denote this radial distance asDist0, which captures hierarchical depth and serves as a primary signal for granularity\. We use the variant trained on the WordNet hierarchy, as WordNet\(Miller,[1994](https://arxiv.org/html/2605.26620#bib.bib1)\)provides a broad\-coverage commonsense structure\.

WhileDist0captures global hierarchical position, additional signals can be obtained by relating to other entities in the space\. Therefore, we compare the input embedding against a set of anchor entities and derive features from the resulting pairwise relations\. In our default configuration, we use 999 randomly sampled fixed anchors, which performed best in our ablation \([Appendix G](https://arxiv.org/html/2605.26620#A7)\)\. Alternative strategies are described in[Section 3\.3](https://arxiv.org/html/2605.26620#S3.SS3)\.

[Figure 2](https://arxiv.org/html/2605.26620#S3.F2)illustrates the resulting pipeline\. Given an input word or phrase, the model first obtains a hierarchical embedding and extractsDist0\. It then computes pairwise similarity and distance features to the anchor entities using a[Wikidata\-derived](https://huggingface.co/datasets/philippesaade/wikidata)embedding index\. To map these features to a scalar granularity score, we train gradient\-boosted decision trees usingLightGBM\(Keet al\.,[2017](https://arxiv.org/html/2605.26620#bib.bib27)\)\. The model operates directly on the raw similarity and distance values, allowing it to capture fine\-grained interaction patterns that would be lost under pre\-aggregation\. Details on the training procedure and model hyperparameters are provided in[Appendix E](https://arxiv.org/html/2605.26620#A5)\. Because the resulting raw scores depend on the annotations ofGranola\-EQ, we convert them to percentile scores using a fixed calibration distribution\. We choose the WordNet noun set \(approximately 119k concepts\), which was also used to train the HiT model, providing an annotator\-independent alignment\.[AppendixF\.3](https://arxiv.org/html/2605.26620#A6.SS3)shows how annotation levels map to raw and percentile scores\.

### 3\.1Dataset

To train theLightGBMmodel, we useGRANOLA\-EQ\(Yonaet al\.,[2024](https://arxiv.org/html/2605.26620#bib.bib4)\), an extension of theENTITYQUESTIONSdataset\(Sciavolinoet al\.,[2021](https://arxiv.org/html/2605.26620#bib.bib19)\)\. Each dataset entry consists of a question and a set of answers referring to the same underlying*reference entity*at different levels of granularity\. We refer to the ordered list of such answers as an*answer hierarchy*, and to the individual answers as*granularity realizations*\.

During preprocessing, we remove entries with more than four granularity realizations \(fewer than 1\.2% of the data\) as these typically reflect inconsistencies introduced during generation\. The resulting dataset contains, on average, approximately three realizations per question \(2% with one, 22% with two, 62% with three, and 14% with four\)\.

SinceGRANOLA\-EQwas generated by prompting an LLM to list increasingly coarse answers, the number of realizations per question varies, and no fixed hierarchical structure is enforced \(e\.g\.,city→\\rightarrowstate→\\rightarrowcountry\)\. The LLM implicitly determines the resolution of the answer hierarchy it considers appropriate for a given question\. To obtain comparable training targets, we normalize the answer levels to a continuous scale from 1 \(most fine\-grained\) to 4 \(most coarse\-grained\); for example, a hierarchy with three answers is mapped to levels1,2\.5,4\{1,2\.5,4\}\.

Due to the construction ofGRANOLA\-EQ, the same entity may appear at different granularity levels across dataset entries depending on the question context \(e\.g\.,Englandappears 487 times with a mean granularity of 3\.25 and variance 0\.44\)\. We retain these variations, allowing the model to learn from multiple granularity annotations of the same realization and encouraging generalization\.

Finally, to prevent data leakage, we enforce that no granularity realization appears in more than one split\. The final dataset consists of 6,702 training samples and 1,220 samples each for development and test\. For dataset details, see[Appendix F](https://arxiv.org/html/2605.26620#A6)\.

### 3\.2Extension to Multi\-Word Inputs

Because Granuscore is defined for individual referential units, we extend it to sentences and longer text spans by decomposing the text\. We use aspaCy\-based splitter\(Honnibalet al\.,[2020](https://arxiv.org/html/2605.26620#bib.bib29)\)\. Noun phrases are kept intact, while stop words and non\-informative symbols are removed\. This preserves referential expressions that convey granularity while avoiding fragmentation of multi\-word concepts\. If no referential units can be identified \(e\.g\., the input consists solely of stop words\), we assign a Granuscore of 100, corresponding to the coarsest granularity score\.

We avoid decomposing inputs into atomic facts, as a single fact may contain multiple entities with different granularity levels, making fine\-grained attribution difficult\. For example, in[Figure 1](https://arxiv.org/html/2605.26620#S1.F1), both the person and the location contribute to the sentence’s granularity, such that modifying either referent changes the perceived granularity\. Moreover, atomic fact decomposition can introduce or duplicate lexical material not explicitly present in the original sentence, which may bias the resulting scores\(Wanneret al\.,[2024](https://arxiv.org/html/2605.26620#bib.bib12)\)\.

Finally, we compute the granularity score for a multi\-word input using a two\-step aggregation\. First, we compute a sentence\-level Granuscore by averaging the scores of the extracted referential units within each sentence\. We then aggregate across sentences by taking the mean of the bottom 80% of sentence Granuscores, which reduces the influence of unusually high scores\. We evaluate a range of alternative aggregation strategies and compare them in[Appendix H](https://arxiv.org/html/2605.26620#A8)\. Based on this ablation, we adopt this aggregation as the default, as it shows the strongest performance\.

### 3\.3Methods

![Refer to caption](https://arxiv.org/html/2605.26620v1/x3.png)Figure 3:Illustration of the hierarchical embedding space\. Referential units are embedded in a radial semantic hierarchy, with coarser concepts closer to the center and finer\-grained concepts in outer regions\.To evaluate the effectiveness of Granuscore, we compare it against several baselines and variants that estimate granularity using lexical, hierarchical, or embedding\-based signals\. The embedding\-based variants are illustrated in[Figure 3](https://arxiv.org/html/2605.26620#S3.F3)\.

- •Word Count:Number of words in the text\. Negative word count so that higher scores correspond to coarser concepts\.
- •WordNet Hierarchy:Average depth of mapped WordNet synsets; deeper nodes correspond to finer concepts\.
- •GPT\-4\.1 mini:Few\-shot prompting to estimate granularity \([Appendix D](https://arxiv.org/html/2605.26620#A4)\)\.
- •HiT Dist0:radial distancedist0only\.
- •Nearest Neighbors \(NN\):top\-kkcosine\-similar entities as anchors\.
- •Random:kkdynamically sampled anchors\.
- •Random Anchors:fixedkkrandom anchors\.
- •Radial Anchors:fixed set ofkkanchors sampled across HiTdist0distance bins\.

For Nearest Neighbors, Random, and Random Anchors, we evaluate both HiT and MiniLM embeddings\. MiniLM serves as a widely used non\-hierarchical embedding baseline to contextualize the contribution of hierarchical representations\. Radial Anchors are only defined for HiT, as they rely on the HiTDist0radial structure\. Additional details on the methods are provided in[Appendix C](https://arxiv.org/html/2605.26620#A3)\. Exact versions of all models used throughout the paper are listed in[Appendix B](https://arxiv.org/html/2605.26620#A2)\.

### 3\.4Evaluation Approaches

We evaluate Granuscore across three complementary settings that test different aspects\. First, we measure how well methods recover controlled granularity orderings\. Second, we examine whether granularity scores capture differences across discourse contexts\. Finally, we analyze how Granuscore relates to sentence specificity\.

#### GRANOLA\-EQ

We first evaluate all methods on the test set ofGRANOLA\-EQ\. We usePairwise Accuracy, defined as the percentage of correctly ordered granularity realizations\. For realizationRiR\_\{i\}andRjR\_\{j\}with gold orderingRi,gold<Rj,goldR\_\{i,\\text\{gold\}\}<R\_\{j,\\text\{gold\}\}, a prediction is considered correct ifRi,pred<Rj,predR\_\{i,\\text\{pred\}\}<R\_\{j,\\text\{pred\}\}\. Pairs with identical gold granularity levels are excluded because the resolution of theGRANOLA\-EQannotations does not define a unique ordering\.

We compute this metric in two settings\. In theglobalsetting, pairwise accuracy is computed across all dataset entries, measuring the ability of a method to assign consistent granularity scores to unrelated entities\. This task is particularly challenging because entities may belong to different semantic dimensions and must be placed on a shared granularity scale\. For example, a model must compare realizations such asskateboarderandCalifornia, originating from different hierarchies \(e\.g\.,Tony Hawk→\\rightarrowAmerican skateboarder→\\rightarrowskateboarder→\\rightarrowsportsmanandSan Diego→\\rightarrowCalifornia→\\rightarrowUnited States→\\rightarrowAmerica\)\.

In theintra\-entrysetting, pairwise accuracy is computed within the hierarchy of each entry\. This evaluates how well methods recover the local ordering of semantically related realizations\.

#### Discourse Contexts

Large\-scale annotations of granularity for longer text are difficult to obtain, so we instead rely on naturally occurring discourse differences as an unsupervised proxy\. Scientific papers provide a suitable testbed, as their standardized section structure reflects distinct discourse functions: Introduction sections typically describe the broader research context using more coarse\-grained references, whereas Related Work sections contain more fine\-grained references to specific prior methods, datasets, and papers, reflecting common rhetorical structures in scientific writing\(Swales,[1990](https://arxiv.org/html/2605.26620#bib.bib58); Day and Gastel,[2012](https://arxiv.org/html/2605.26620#bib.bib59)\)\.

We apply Granuscore to scientific articles from theS2ORCcorpus\(Loet al\.,[2020](https://arxiv.org/html/2605.26620#bib.bib14)\)\. We sample 1,000 papers and compare the granularity of Introduction and Related Work sections\. Further details on the sampling and filtering procedure are provided in[Appendix J](https://arxiv.org/html/2605.26620#A10)\.

#### Sentence Specificity

Finally, we examine how Granuscore relates to sentence specificity\. We analyze human\-annotated datasets fromKoet al\.\([2019](https://arxiv.org/html/2605.26620#bib.bib16)\)andLiet al\.\([2016](https://arxiv.org/html/2605.26620#bib.bib11)\)\. The former covers movie reviews, tweets, and Yelp reviews, while the latter contains sentences from news articles\.

Sentence length \(word count\) is a strong baseline predictor with Spearman correlations of 0\.45 \(Twitter\), 0\.58 \(movie reviews\), 0\.68 \(Yelp\), and 0\.67 \(news\)\. We therefore quantify Granuscore’s contribution to sentence specificity beyond sentence length by fitting Generalized Additive Models \(GAMs\)\. This allows us to isolate the contribution of each predictor via explained deviance\.

## 4Results

Below, we report results for the three evaluation settings introduced above\.

### 4\.1GRANOLA\-EQ

[Table 1](https://arxiv.org/html/2605.26620#S4.T1)reports the performance of all methods onGRANOLA\-EQusing Global and Intra\-entry Pairwise Accuracy and Exact Ordering Accuracy\. Additional metrics are provided in[AppendixF\.4](https://arxiv.org/html/2605.26620#A6.SS4)\.

Across all methods, intra\-entry pairwise accuracy consistently exceeds global pairwise accuracy, indicating that ranking realizations within the same semantic hierarchy is easier than assigning consistent scores across unrelated entities\. Exact ordering accuracy is consistently lower, reflecting the greater difficulty of recovering the full hierarchy rather than individual pairwise relations\.

The radial depth signal already provides a strong baseline: HiT Dist0 achieves 80\.82% global pairwise accuracy and 87\.86% intra\-entry accuracy\.

MethodGlobal PWAcc\.Intra PWAcc\.ExactWord Count50\.4951\.5428\.80WordNet†58\.1267\.3260\.39GPT\-4\.1 mini76\.7681\.5858\.21HiT Dist080\.8287\.8673\.50MiniLM NN67\.4069\.5048\.97MiniLM Random67\.5570\.5748\.80MiniLM RandomAnch67\.4571\.1549\.15HiT NN81\.7986\.4770\.17HiT Random80\.8084\.7469\.32HiT RadialAnch82\.3188\.3571\.71HiT RandomAnch83\.0088\.1572\.48HiT Dist0 \+ NN82\.8687\.0271\.54HiT Dist0 \+ Random82\.0186\.8270\.85HiT Dist0 \+ RadialAnch83\.2288\.8373\.85HiT Dist0 \+ RandomAnch83\.7689\.0374\.36Table 1:Comparison of methods on theGRANOLA\-EQtest set\. We report Global Pairwise Accuracy \(PW Acc\.\), Intra\-entry Pairwise Accuracy \(Intra PW Acc\.\), and Exact Ordering Accuracy \(Exact\)\. Bold indicates the best result and italic indicates the second\-best result across methods\.†: WordNet could only derive a granularity level for 17\.61% of the realizations\.In comparison, anchor\-based HiT methods generally outperform it in global pairwise accuracy, with HiT Random Anchors outperforming HiT Dist0 by 2\.18 percentage points\. Notably, anchors sampled across the embedding space outperform nearest neighbors, suggesting that global structure is more informative for estimating granularity than local similarity\.

Combining HiT Dist0 with Random Anchors achieves the highest scores across all metrics\. Compared to HiT Dist0, it improves by \+2\.94 \(global\) and \+1\.17 \(intra\-entry\) points, and over HiT Random Anchors by \+0\.76 and \+0\.88 \(bootstrap resampling,N=20,000N=20\{,\}000,p<0\.002p<0\.002global;p<0\.05p<0\.05intra\-entry\)\.

In contrast, all MiniLM\-based variants perform substantially worse \(best global: 67\.55%\) and yield nearly identical scores regardless of the anchor selection strategy\. This suggests that the underlying embedding geometry provides a weaker signal for granularity than HiT\.

Among additional baselines, the WordNet hierarchy achieves 58\.12% global pairwise accuracy despite covering only 17\.61% of realizations\. This shows that lexical taxonomies contain meaningful signals for granularity when applicable, but their use is limited by coverage\. GPT\-4\.1 mini performs competitively in pairwise ordering but shows lower exact ordering accuracy\. Here, manual inspection indicates that the model frequently assigns identical granularity levels to multiple realizations, thereby reducing its ability to recover the full hierarchy\. Finally, the Word Count baseline performs close to random \(50%\), confirming that granularity is not reflected in sentence length\.

Overall, these results show that granularity is best captured by combining hierarchical depth and anchor comparisons in the embedding space\.

### 4\.2Granuscore Across Paper Sections

Beyond the gold\-labeled setting, we evaluate whether Granuscore captures differences across discourse contexts\. Across the sampled papers, 68\.71% of paired comparisons exhibit a higher Granuscore for the Introduction than for the Related Work section, indicating that Introduction sections tend to use more coarse\-grained language\. This difference is highly significant \(pairedtt\-test:p≤5\.49×10−37p\\leq 5\.49\\times 10^\{\-37\}; Wilcoxon signed\-rank test:p≤5\.73×10−39p\\leq 5\.73\\times 10^\{\-39\}\) with a moderate paired effect size \(dz=0\.42d\_\{z\}=0\.42;Cohen,[2013](https://arxiv.org/html/2605.26620#bib.bib55)\)\. Consistent with this ordering, Introduction sections also have a higher average Granuscore \(75\.29±4\.9475\.29\\pm 4\.94\) than Related Work sections \(72\.43±5\.5472\.43\\pm 5\.54\)\.

This pattern aligns with the rhetorical roles of these sections\.

### 4\.3Correlation to Sentence Specificity

DomainExpl\. Dev\.\(Length\)Expl\. Dev\.\(Len\+Gran\)Δ\\Deltamovie0\.380\.46\+0\.08twitter0\.240\.36\+0\.12yelp0\.520\.56\+0\.04news0\.450\.55\+0\.10Table 2:Explained deviance of generalized additive models \(GAMs\) predicting sentence specificity\. All smooth terms are significant \(p<2\.41×10−10p<2\.41\\times 10^\{\-10\}\)\.![Refer to caption](https://arxiv.org/html/2605.26620v1/x4.png)Figure 4:Effect of Granuscore on sentence specificity across domains\. Lower specificity scores correspond to more specific sentences\. The plotted range is restricted to the 1st–99th percentiles of Granuscore to avoid sparse\-support regions\.Finally, we analyze how Granuscore relates to sentence specificity\.[Table 2](https://arxiv.org/html/2605.26620#S4.T2)shows that adding Granuscore consistently improves explained deviance over a length\-only baseline across all domains\. Absolute gains range from \+0\.04 \(Yelp\) to \+0\.12 \(Twitter\), corresponding to relative improvements of 7–50%\. In[Figure 4](https://arxiv.org/html/2605.26620#S4.F4), we show the estimated effect of Granuscore on sentence specificity\. Across all domains, the relationship is non\-linear\. Lower Granuscore negatively affects the specificity score, indicating an association with more specific sentences\. As Granuscore increases, the magnitude of this negative effect decreases\. The effect crosses zero between values of roughly 63–66, after which higher scores are associated with less specific sentences\. For completeness, the effect of sentence length is shown in[Appendix I](https://arxiv.org/html/2605.26620#A9)\.

Overall, these results show that Granuscore captures a systematic component of sentence specificity while remaining distinct from it\. Although granularity alone does not determine specificity, incorporating Granuscore consistently improves specificity prediction across domains\. This pattern aligns with the intuition that references to fine\-grained entities tend to appear in more specific sentences, whereas coarse\-grained concepts are more common in less specific ones\.

## 5Applying Granuscore to QA Datasets

We apply Granuscore to several widely used QA datasets to investigate how granularity affects dataset properties and model performance\. We use the public splits ofFACTS Parametric\(Chenget al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib23)\)\(1,047 samples\),SimpleQA\(Weiet al\.,[2024](https://arxiv.org/html/2605.26620#bib.bib22)\)\(4,255\),SQuAD\(Rajpurkaret al\.,[2016](https://arxiv.org/html/2605.26620#bib.bib20)\)\(10,570\), andTruthfulQA\(Linet al\.,[2022](https://arxiv.org/html/2605.26620#bib.bib21)\)\(817\), resulting in a total of 16,689 samples\.

To relate granularity to model behavior, we evaluate model correctness usingQwen3 0\.6B,Qwen3\-8B, andQwen3\-32B\(Yanget al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib24)\),Olmo 3 7B\(Olmoet al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib47)\), andDeepSeek V3\.2\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib2)\)\. These models cover a broad range of model sizes and represent well\-established open\-weight language models\. ForQwen3\-8B, we additionally evaluate both standard generation and reasoning\-enabled generation \(“think” mode\) to compare performance with and without explicit reasoning\.

Model responses are evaluated using GPT\-4\.1 nano as an LLM\-based judge, following the prompt template introduced inSimpleQA\(Weiet al\.,[2024](https://arxiv.org/html/2605.26620#bib.bib22)\)\. Further details are given in[Appendix L](https://arxiv.org/html/2605.26620#A12)\.

#### Granuscore Gold Answers

![Refer to caption](https://arxiv.org/html/2605.26620v1/x5.png)Figure 5:Relationship between dataset\-level gold answer Granuscore and model correctness across QA benchmarks\. Higher Granuscore datasets are associated with higher correctness across models\. All pairwise differences in Granuscore between datasets are statistically significant \(Mann–WhitneyUU,p≤1\.1×10−3p\\leq 1\.1\\times 10^\{\-3\}\)\.In[Figure 5](https://arxiv.org/html/2605.26620#S5.F5), we relate model correctness to the Granuscore of gold answers across QA datasets\. For each of the models, correctness varies strongly across datasets, with mean accuracies of 4\.99% onFACTS Parametric, 7\.80% onSimpleQA, 30\.24% onSQuAD, and 43\.47% onTruthfulQA\.

Datasets with lower Granuscores exhibit substantially lower accuracy, while higher Granuscore datasets are associated with improved performance\. This pattern is consistent across all evaluated models, suggesting that Granuscore captures a model\-independent aspect of question difficulty\. Larger models, such asDeepSeek V3\.2, achieve consistently higher correctness across datasets, indicating greater knowledge coverage, but follow the same overall trend\. In contrast, the smallest model,Qwen3 0\.6B, exhibits a weaker slope, likely reflecting general capacity limitations\. We observe a similar trend when analyzing the Granuscore of questions \([Appendix K](https://arxiv.org/html/2605.26620#A11)\)\. In contrast, potential confounding factors, including answer and question length, word frequency, and syntactic complexity, do not yield comparably consistent relationships with correctness \([Appendix K](https://arxiv.org/html/2605.26620#A11)\)\.

#### Granuscore Across Response Outcomes

TypeCorrectWrongNot Att\.Question70\.1±0\.170\.1\\pm 0\.1\.5\)65\.4±0\.065\.4\\pm 0\.0\.3\)67\.2±0\.067\.2\\pm 0\.0\.6\)Gold Answer59\.4±0\.459\.4\\pm 0\.4\.1\)45\.8±0\.045\.8\\pm 0\.0\.7\)48\.7±0\.248\.7\\pm 0\.2\.7\)Answer69\.6±0\.169\.6\\pm 0\.1\.7\)66\.0±0\.166\.0\\pm 0\.1\.4\)72\.5±0\.172\.5\\pm 0\.1\.3\)Table 3:Granuscore \(mean±\\pmstd\. across models\) by response outcome \(Correct, Wrong, and Not Attempted\)\. Granuscore distributions differ significantly across outcomes \(Mann–WhitneyUU,p≤2\.42×10−13p\\leq 2\.42\\times 10^\{\-13\}\)\.In[Table 3](https://arxiv.org/html/2605.26620#S5.T3), we report mean Granuscore values for questions, gold answers, and model outputs, stratified by response outcome \(correct, incorrect, and not\-attempted\)\. Model outputs associated with wrong answers exhibit the lowest average Granuscore \(66\.0\), followed by correct answers \(69\.6\), while not\-attempted responses show the highest output granularity \(72\.5\)\. The latter is expected, as abstentions typically consist of general statements indicating an inability to provide an answer\.

On the input side, incorrect responses are associated with lower\-granularity questions and gold answers, followed by not\-attempted cases and then correct responses\.

Finally, we analyze the*granularity gap*, defined as the difference between model output and gold\-answer Granuscore\. The gap is substantially larger for incorrect and not\-attempted responses than for correct ones\. Using five\-fold cross\-validation, a logistic regression with granularity gap as the sole predictor achieves an average AUC of 0\.62 \(±\\pm0\.005\), indicating a moderate, stable association between granularity mismatch and response failure\.

## 6Discussion

#### Comparison of Methods

Our results onGRANOLA\-EQhighlight the importance of hierarchical structure for estimating granularity\. Methods based on HiT consistently outperform approaches using standard sentence embeddings, indicating that granularity is closely tied to hierarchical relations rather than surface similarity\. Importantly, the radial depth signal \(Dist0\) already outperforms several baselines without any training onGRANOLA\-EQ, indicating that the hierarchical embedding space itself captures meaningful granularity signals independent of the learned mapping\. However, anchor\-based comparisons further improve performance, particularly for global ordering\. Comparing entities to anchors across the embedding space provides additional relational context, enabling a more reliable and stable estimation of granularity across unrelated hierarchies\. Hence, the best results are achieved when combining both signals\.

This challenge of comparing unrelated hierarchies is also reflected in the evaluation metrics\. We observe a consistent gap between intra\-entry and global accuracy: intra\-entry comparisons operate within a shared semantic hierarchy \(e\.g\., city→\\tostate→\\tocountry\), whereas global comparisons require ordering entities from unrelated hierarchies on a common scale\. Despite this difficulty, Granuscore provides a strong signal for estimating general granularity across heterogeneous hierarchies\.

#### Correlation to Sentence Specificity

Further, we show that Granuscore explains non\-linear variation in sentence specificity beyond sentence length, which serves as a strong baseline indicator\(Gaoet al\.,[2019](https://arxiv.org/html/2605.26620#bib.bib13); Koet al\.,[2019](https://arxiv.org/html/2605.26620#bib.bib16)\)\. The consistency of this relationship across heterogeneous domains supports the robustness of Granuscore as a general granularity measure\.

#### Granularity and QA Performance

Our QA analysis case study reveals consistent patterns linking granularity and response outcomes\.

First, across all models and 16,689 QA\-samples, we observe clear differences in the Granuscore across response outcomes\. Questions and gold answers associated with incorrect or not\-attempted responses exhibit significantly lower Granuscore values than those associated with correct responses\. The effect is particularly pronounced for gold answers, while the difference in question granularity is present but weaker\. These findings suggest that Granuscore may serve as a proxy for the difficulty of question–answer pairs and could be incorporated as a signal for deciding when to rely on a model’s internal knowledge versus external tools\.

At the dataset level, we observe a complementary trend: datasets with lower granularity, both for gold answers and questions, are substantially harder for models\. This suggests that granularity provides a useful lens for characterizing differences in QA\-difficulty that are not explained by superficial properties such as answer or question length\.

Finally, we observe that incorrect responses tend to exhibit lower output granularity than correct ones\. In these cases, models often remain at the level of detail implied by the question rather than adapt their responses to a more appropriate granularity based on their confidence in the answer\. This aligns with findings that models struggle to adjust answer granularity\(Yonaet al\.,[2024](https://arxiv.org/html/2605.26620#bib.bib4)\)andKalaiet al\.\([2025](https://arxiv.org/html/2605.26620#bib.bib52)\)arguing that benchmark evaluations incentivize models to guess overly specific answers\.

#### Future Directions

A natural next step is to use Granuscore as a training signal for language models\. Prior work has shown that optimizing for properties such as informativeness and interestingness can improve response quality and user engagement\(Adiwardanaet al\.,[2020](https://arxiv.org/html/2605.26620#bib.bib7); Thoppilanet al\.,[2022](https://arxiv.org/html/2605.26620#bib.bib30); Onozeki and Inaba,[2025](https://arxiv.org/html/2605.26620#bib.bib6)\)\. Similarly, Granuscore could encourage models to generate responses at appropriate levels of granularity\. In particular, models could learn to align output granularity with their confidence: when uncertain about fine\-grained details, they may respond at a coarser but reliable level \(e\.g\., a broader category or time period\)\. Such behavior mirrors human communication and could help reduce overly fine\-grained incorrect answers while preserving informative responses\.

Beyond response generation, Granuscore may also support controlled language adaptation, such as simplification, including summarization\(Stollet al\.,[2022](https://arxiv.org/html/2605.26620#bib.bib40)\)and definition generation\(Ellingeret al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib31)\), where appropriate granularity is crucial for producing accessible yet informative text\.

## 7Conclusion

We introduced Granuscore, a reference\-free measure that quantifies the granularity expressed in text using a hierarchical embedding space\. Granuscore reliably recovers granularity orderings on the controlledGRANOLA\-EQbenchmark, aligns with expected differences across scientific paper sections, and captures non\-linear variation in sentence specificity beyond sentence length\.

Applied to question answering, Granuscore provides a useful lens for characterizing dataset difficulty and understanding differences in model performance and outputs\.

## Limitations

#### Dependence on WordNet Hierarchy\.

Granuscore relies on a single hierarchical embedding model fine\-tuned on the WordNet hierarchy\. We choose this variant because WordNet provides broad\-coverage, general\-purpose commonsense structure\. This choice might limit granularity estimation in domains that are poorly represented in WordNet\. Future work could explore domain\-specific hierarchical models and evaluate their impact, whereas we intentionally focus on general applicability in this work\.

#### Human Perception of Granularity\.

While GRANOLA\-EQ is manually validated by human annotators, our evaluation does not include a dedicated human study directly comparing Granuscore scores against explicit human granularity judgments\. Instead, we focus on broad empirical validation across multiple complementary settings, including hierarchical ordering, discourse\-level analyses, sentence specificity, and downstream QA behavior\.

## Acknowledgments

All analysis, research, and ideas are either our own or cited\. This work used LLM\-based tools for language edits and clarity improvements\. This research has been funded by the German Federal Ministry of Research, Technology, and Space \(BMFTR\) through grant 01IS23069 Software Campus 3\.0 \(Technical University of Munich\) as part of the Software Campus project “Know ELViS”\.

## References

- D\. Adiwardana, M\. Luong, D\. R\. So, J\. Hall, N\. Fiedel, R\. Thoppilan, Z\. Yang, A\. Kulshreshtha, G\. Nemade, Y\. Lu, and Q\. V\. Le \(2020\)Towards a Human\-like Open\-Domain Chatbot\.\(en\)\.External Links:[Link](https://arxiv.org/abs/2001.09977v3)Cited by:[§1](https://arxiv.org/html/2605.26620#S1.p3.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px4.p1.1),[§6](https://arxiv.org/html/2605.26620#S6.SS0.SSS0.Px4.p1.1)\.
- Optuna: A Next\-Generation Hyperparameter Optimization Framework\.InThe 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,pp\. 2623–2631\.Cited by:[Appendix E](https://arxiv.org/html/2605.26620#A5.p1.1)\.
- M\. Anschütz, T\. M\. Pham, E\. Nasrallah, M\. Müller, C\. Craciun, and G\. Groh \(2025\)German4All – A Dataset and Model for Readability\-Controlled Paraphrasing in German\.InProceedings of the 18th International Natural Language Generation Conference,L\. Flek, S\. Narayan, L\. H\. Phương, and J\. Pei \(Eds\.\),Hanoi, Vietnam,pp\. 390–407\.External Links:[Link](https://aclanthology.org/2025.inlg-main.24/)Cited by:[§1](https://arxiv.org/html/2605.26620#S1.p3.1)\.
- J\. Chen, Y\. He, I\. Horrocks, and Z\. Yuan \(2024\)Language Models as Hierarchy Encoders\.InAdvances in Neural Information Processing Systems 37,Vancouver, BC, Canada,pp\. 14690–14711\(en\)\.External Links:ISBN 979\-8\-3313\-1438\-5,[Link](http://www.proceedings.com/079017-0469.html),[Document](https://dx.doi.org/10.52202/079017-0469)Cited by:[Table 4](https://arxiv.org/html/2605.26620#A2.T4.1.2.1),[§3](https://arxiv.org/html/2605.26620#S3.p1.1)\.
- A\. Cheng, A\. Jacovi, A\. Globerson, B\. Golan, C\. Kwong, C\. Alberti, C\. Tao, E\. Ben\-David, G\. S\. Tomar, L\. Haas, Y\. Bitton, A\. Bloniarz, A\. Bai, A\. Wang, A\. Siddiqui, A\. B\. Castillo, A\. Atias, C\. Liu, C\. Fry, D\. Balle, D\. Ghosal, D\. Kukliansky, D\. Marcus, E\. Gribovskaya, E\. Ofek, H\. Zhuang, I\. Laish, J\. Ackermann, L\. Wang, M\. Risdal, M\. Barnes, M\. Fink, M\. Amin, M\. Ambar, N\. Potikha, N\. Gupta, N\. Katz, N\. Velan, O\. Roval, O\. Ram, P\. Zablotskaia, P\. Bang, P\. Agrawal, R\. Ghiya, S\. Ganapathy, S\. Baumgartner, S\. Erell, S\. Prakash, T\. Sellam, V\. Rao, X\. Wang, Y\. Akulov, Y\. Yang, Z\. Yang, Z\. Lai, Z\. Wu, A\. Dragan, A\. Hassidim, F\. Pereira, S\. Petrov, S\. Venkatachary, T\. Doshi, Y\. Matias, S\. Goldshtein, and D\. Das \(2025\)The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality\.arXiv\.Note:arXiv:2512\.10791 \[cs\]External Links:[Link](http://arxiv.org/abs/2512.10791),[Document](https://dx.doi.org/10.48550/arXiv.2512.10791)Cited by:[§5](https://arxiv.org/html/2605.26620#S5.p1.1)\.
- J\. Cohen \(2013\)Statistical Power Analysis for the Behavioral Sciences\.2 edition,Routledge,New York\.External Links:ISBN 978\-0\-203\-77158\-7,[Document](https://dx.doi.org/10.4324/9780203771587)Cited by:[§4\.2](https://arxiv.org/html/2605.26620#S4.SS2.p1.6)\.
- R\.A\. Day and B\. Gastel \(2012\)How to write and publish a scientific paper\.Cambridge University Press\.External Links:ISBN 9781107670747,[Link](https://books.google.de/books?id=h0oWR3_cVrMC)Cited by:[§3\.4](https://arxiv.org/html/2605.26620#S3.SS4.SSS0.Px2.p1.1)\.
- DeepSeek\-AI, A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Zhang, H\. Ding, H\. Xin, H\. Gao, H\. Li, H\. Qu, J\. L\. Cai, J\. Liang, J\. Guo, J\. Ni, J\. Li, J\. Wang, J\. Chen, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, L\. Zhao, L\. Wang, L\. Zhang, M\. Li, M\. Wang, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, N\. Tian, P\. Huang, P\. Wang, P\. Zhang, Q\. Wang, Q\. Zhu, Q\. Chen, Q\. Du, R\. J\. Chen, R\. L\. Jin, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Xu, R\. Zhang, R\. Chen, S\. S\. Li, S\. Lu, S\. Zhou, S\. Chen, S\. Wu, S\. Ye, S\. Ye, S\. Ma, S\. Wang, S\. Zhou, S\. Yu, S\. Zhou, S\. Pan, T\. Wang, T\. Yun, T\. Pei, T\. Sun, W\. L\. Xiao, W\. Zeng, W\. Zhao, W\. An, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, X\. Q\. Li, X\. Jin, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Shen, X\. Chen, X\. Zhang, X\. Chen, X\. Nie, X\. Sun, X\. Wang, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Song, X\. Shan, X\. Zhou, X\. Yang, X\. Li, X\. Su, X\. Lin, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. X\. Zhu, Y\. Zhang, Y\. Xu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Li, Y\. Wang, Y\. Yu, Y\. Zheng, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Tang, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Wu, Y\. Ou, Y\. Zhu, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Zha, Y\. Xiong, Y\. Ma, Y\. Yan, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Huang, Z\. Zhang, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Xu, Z\. Wu, Z\. Zhang, Z\. Li, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Gao, and Z\. Pan \(2025\)DeepSeek\-V3 Technical Report\.arXiv\.Note:arXiv:2412\.19437 \[cs\]External Links:[Link](http://arxiv.org/abs/2412.19437),[Document](https://dx.doi.org/10.48550/arXiv.2412.19437)Cited by:[Table 4](https://arxiv.org/html/2605.26620#A2.T4.1.11.1),[§5](https://arxiv.org/html/2605.26620#S5.p2.1)\.
- P\. Dixon \(1987\)The processing of organizational and component step information in written directions\.Journal of Memory and Language26\(1\),pp\. 24–35\.External Links:ISSN 0749\-596X,[Link](https://www.sciencedirect.com/science/article/pii/0749596X8790060X),[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0749-596X%2887%2990060-X)Cited by:[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Djalali, D\. Clausen, S\. Lauer, K\. Schultz, and C\. Potts \(2011\)Modeling Expert Effects and Common Ground Using Questions under Discussion\.InAAAI Fall Symposium: Building Representations of Common Ground with Intelligent Agents,Cited by:[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Douze, A\. Guzhva, C\. Deng, J\. Johnson, G\. Szilvasy, P\. Mazaré, M\. Lomeli, L\. Hosseini, and H\. Jégou \(2025\)The Faiss library\.arXiv\.Note:arXiv:2401\.08281 \[cs\]External Links:[Link](http://arxiv.org/abs/2401.08281),[Document](https://dx.doi.org/10.48550/arXiv.2401.08281)Cited by:[Appendix C](https://arxiv.org/html/2605.26620#A3.p2.1)\.
- L\. Ellinger, M\. Anschütz, and G\. Groh \(2025\)Simplifications Are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM\-Generated Definitions\.InProceedings of the 15th International Conference on Recent Advances in Natural Language Processing \- Natural Language Processing in the Generative AI era,Varna, Bulgaria,pp\. 342–351\.External Links:[Link](https://aclanthology.org/2025.ranlp-1.42)Cited by:[§6](https://arxiv.org/html/2605.26620#S6.SS0.SSS0.Px4.p2.1)\.
- Y\. Gao, Y\. Zhong, D\. Preoţiuc\-Pietro, and J\. J\. Li \(2019\)Predicting and Analyzing Language Specificity in Social Media Posts\.Proceedings of the AAAI Conference on Artificial Intelligence33\(01\),pp\. 6415–6422\(en\)\.External Links:ISSN 2374\-3468,[Link](https://ojs.aaai.org/index.php/AAAI/article/view/4605),[Document](https://dx.doi.org/10.1609/aaai.v33i01.33016415)Cited by:[§6](https://arxiv.org/html/2605.26620#S6.SS0.SSS0.Px2.p1.1)\.
- J\. R\. Hobbs \(1985\)Granularity\.InProceedings of the 9th International Joint Conference on Artificial Intelligence \- Volume 1,IJCAI’85,San Francisco, CA, USA,pp\. 432–435\.External Links:ISBN 0\-934613\-02\-8Cited by:[§1](https://arxiv.org/html/2605.26620#S1.p1.1),[§1](https://arxiv.org/html/2605.26620#S1.p2.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Honnibal, I\. Montani, S\. Van Landeghem, and A\. Boyd \(2020\)spaCy: Industrial\-strength Natural Language Processing in Python\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.1212303)Cited by:[Appendix K](https://arxiv.org/html/2605.26620#A11.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.26620#S3.SS2.p1.1)\.
- J\. Huang, K\. C\. Chang, J\. Xiong, and W\. Hwu \(2023\)Can Language Models Be Specific? How?\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 716–727\.External Links:[Link](https://aclanthology.org/2023.findings-acl.45/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.45)Cited by:[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px3.p2.1)\.
- A\. T\. Kalai, O\. Nachum, S\. S\. Vempala, and E\. Zhang \(2025\)Why Language Models Hallucinate\.arXiv\.Note:arXiv:2509\.04664 \[cs\]External Links:[Link](http://arxiv.org/abs/2509.04664),[Document](https://dx.doi.org/10.48550/arXiv.2509.04664)Cited by:[§6](https://arxiv.org/html/2605.26620#S6.SS0.SSS0.Px3.p4.1)\.
- G\. Ke, Q\. Meng, T\. Finley, T\. Wang, W\. Chen, W\. Ma, Q\. Ye, and T\. Liu \(2017\)LightGBM: A Highly Efficient Gradient Boosting Decision Tree\.InAdvances in Neural Information Processing Systems,Vol\.30\.External Links:[Link](https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html)Cited by:[§3](https://arxiv.org/html/2605.26620#S3.p3.1)\.
- G\. Kim \(2006\)Relationship between index term specificity and relevance judgment\.Information Processing & Management42\(5\),pp\. 1218–1229\(en\)\.External Links:ISSN 03064573,[Link](https://linkinghub.elsevier.com/retrieve/pii/S0306457306000057),[Document](https://dx.doi.org/10.1016/j.ipm.2005.12.004)Cited by:[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px1.p2.1)\.
- W\. Ko, G\. Durrett, and J\. J\. Li \(2019\)Domain Agnostic Real\-Valued Specificity Prediction\.Proceedings of the AAAI Conference on Artificial Intelligence33\(01\),pp\. 6610–6617\(en\)\.External Links:ISSN 2374\-3468,[Link](https://ojs.aaai.org/index.php/AAAI/article/view/4630),[Document](https://dx.doi.org/10.1609/aaai.v33i01.33016610)Cited by:[Appendix I](https://arxiv.org/html/2605.26620#A9.p1.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px3.p1.1),[§3\.4](https://arxiv.org/html/2605.26620#S3.SS4.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.26620#S6.SS0.SSS0.Px2.p1.1)\.
- J\. J\. Li, B\. O’Daniel, Y\. Wu, W\. Zhao, and A\. Nenkova \(2016\)Improving the Annotation of Sentence Specificity\.InProceedings of the Tenth International Conference on Language Resources and Evaluation \(LREC’16\),N\. Calzolari, K\. Choukri, T\. Declerck, S\. Goggi, M\. Grobelnik, B\. Maegaard, J\. Mariani, H\. Mazo, A\. Moreno, J\. Odijk, and S\. Piperidis \(Eds\.\),Portorož, Slovenia,pp\. 3921–3927\.External Links:[Link](https://aclanthology.org/L16-1620/)Cited by:[Appendix I](https://arxiv.org/html/2605.26620#A9.p1.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px3.p1.1),[§3\.4](https://arxiv.org/html/2605.26620#S3.SS4.SSS0.Px3.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: Measuring How Models Mimic Human Falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 3214–3252\.External Links:[Link](https://aclanthology.org/2022.acl-long.229/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.229)Cited by:[§5](https://arxiv.org/html/2605.26620#S5.p1.1)\.
- K\. Lo, L\. L\. Wang, M\. Neumann, R\. Kinney, and D\. Weld \(2020\)S2ORC: The Semantic Scholar Open Research Corpus\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4969–4983\.External Links:[Link](https://aclanthology.org/2020.acl-main.447/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.447)Cited by:[§3\.4](https://arxiv.org/html/2605.26620#S3.SS4.SSS0.Px2.p2.1)\.
- G\. A\. Miller \(1994\)WordNet: A Lexical Database for English\.InHuman Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8\-11, 1994,External Links:[Link](https://aclanthology.org/H94-1111/)Cited by:[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px3.p1.1),[§3](https://arxiv.org/html/2605.26620#S3.p1.1)\.
- R\. Mulkar\-Mehta, J\. Hobbs, and E\. Hovy \(2011\)Granularity in Natural Language Discourse\.InProceedings of the Ninth International Conference on Computational Semantics \(IWCS 2011\),J\. Bos and S\. Pulman \(Eds\.\),External Links:[Link](https://aclanthology.org/W11-0143/)Cited by:[§1](https://arxiv.org/html/2605.26620#S1.p1.1),[§1](https://arxiv.org/html/2605.26620#S1.p2.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px1.p1.1)\.
- OECD \(2024\)Do Adults Have the Skills They Need to Thrive in a Changing World?: Survey of Adult Skills 2023\.OECD Skills Studies\(en\)\.External Links:[Link](https://www.oecd.org/en/publications/do-adults-have-the-skills-they-need-to-thrive-in-a-changing-world_b263dc5d-en.html),[Document](https://dx.doi.org/10.1787/b263dc5d-en)Cited by:[§1](https://arxiv.org/html/2605.26620#S1.p3.1)\.
- T\. Olmo, A\. Ettinger, A\. Bertsch, B\. Kuehl, D\. Graham, D\. Heineman, D\. Groeneveld, F\. Brahman, F\. Timbers, H\. Ivison, J\. Morrison, J\. Poznanski, K\. Lo, L\. Soldaini, M\. Jordan, M\. Chen, M\. Noukhovitch, N\. Lambert, P\. Walsh, P\. Dasigi, R\. Berry, S\. Malik, S\. Shah, S\. Geng, S\. Arora, S\. Gupta, T\. Anderson, T\. Xiao, T\. Murray, T\. Romero, V\. Graf, A\. Asai, A\. Bhagia, A\. Wettig, A\. Liu, A\. Rangapur, C\. Anastasiades, C\. Huang, D\. Schwenk, H\. Trivedi, I\. Magnusson, J\. Lochner, J\. Liu, L\. J\. V\. Miranda, M\. Sap, M\. Morgan, M\. Schmitz, M\. Guerquin, M\. Wilson, R\. Huff, R\. L\. Bras, R\. Xin, R\. Shao, S\. Skjonsberg, S\. Z\. Shen, S\. S\. Li, T\. Wilde, V\. Pyatkin, W\. Merrill, Y\. Chang, Y\. Gu, Z\. Zeng, A\. Sabharwal, L\. Zettlemoyer, P\. W\. Koh, A\. Farhadi, N\. A\. Smith, and H\. Hajishirzi \(2025\)Olmo 3\.arXiv\.Note:arXiv:2512\.13961 \[cs\]External Links:[Link](http://arxiv.org/abs/2512.13961),[Document](https://dx.doi.org/10.48550/arXiv.2512.13961)Cited by:[Table 4](https://arxiv.org/html/2605.26620#A2.T4.1.7.1),[§5](https://arxiv.org/html/2605.26620#S5.p2.1)\.
- H\. Onozeki and M\. Inaba \(2025\)Enhancing Coherence and Interestingness in Knowledge\-Grounded Dialogue Generation\.InProceedings of the 18th International Natural Language Generation Conference,L\. Flek, S\. Narayan, L\. H\. Phương, and J\. Pei \(Eds\.\),Hanoi, Vietnam,pp\. 1–19\.External Links:[Link](https://aclanthology.org/2025.inlg-main.1/)Cited by:[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px4.p1.1),[§6](https://arxiv.org/html/2605.26620#S6.SS0.SSS0.Px4.p1.1)\.
- OpenAI \(2025\)Introducing GPT‑4\.1 in the API\.Note:[https://openai\.com/index/gpt\-4\-1/](https://openai.com/index/gpt-4-1/)Accessed: 2026\-05\-12Cited by:[Table 4](https://arxiv.org/html/2605.26620#A2.T4.1.4.1),[Table 4](https://arxiv.org/html/2605.26620#A2.T4.1.5.1)\.
- P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang \(2016\)SQuAD: 100,000\+ Questions for Machine Comprehension of Text\.InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,J\. Su, K\. Duh, and X\. Carreras \(Eds\.\),Austin, Texas,pp\. 2383–2392\.External Links:[Link](https://aclanthology.org/D16-1264/),[Document](https://dx.doi.org/10.18653/v1/D16-1264)Cited by:[§5](https://arxiv.org/html/2605.26620#S5.p1.1)\.
- E\. Rosch, C\. B\. Mervis, W\. D\. Gray, D\. M\. Johnson, and P\. Boyes\-Braem \(1976\)Basic objects in natural categories\.Cognitive Psychology8\(3\),pp\. 382–439\.External Links:ISSN 0010\-0285,[Link](https://www.sciencedirect.com/science/article/pii/001002857690013X),[Document](https://dx.doi.org/10.1016/0010-0285%2876%2990013-X)Cited by:[§1](https://arxiv.org/html/2605.26620#S1.p1.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Sciavolino, Z\. Zhong, J\. Lee, and D\. Chen \(2021\)Simple Entity\-Centric Questions Challenge Dense Retrievers\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 6138–6148\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.496/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.496)Cited by:[§3\.1](https://arxiv.org/html/2605.26620#S3.SS1.p1.1)\.
- R\. Speer, J\. Chin, A\. Lin, S\. Jewett, and L\. Nathan \(2022\)Rspeer/wordfreq: v3\.0 \(v3\.0\.2\)\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.7199437),[Link](https://doi.org/10.5281/zenodo.7199437)Cited by:[Appendix K](https://arxiv.org/html/2605.26620#A11.SS0.SSS0.Px2.p1.1)\.
- M\. Stoll, M\. Kerwer, K\. Lieb, and A\. Chasiotis \(2022\)Plain language summaries: A systematic review of theory, guidelines and empirical research\.PLOS ONE17\(6\),pp\. e0268789\(en\)\.External Links:ISSN 1932\-6203,[Link](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0268789),[Document](https://dx.doi.org/10.1371/journal.pone.0268789)Cited by:[§6](https://arxiv.org/html/2605.26620#S6.SS0.SSS0.Px4.p2.1)\.
- J\.M\. Swales \(1990\)Genre analysis\.Cambridge Applied Linguistics,Cambridge University Press\.External Links:ISBN 9780521338134,LCCN gb90024456,[Link](https://books.google.de/books?id=shX_EV1r3-0C)Cited by:[§3\.4](https://arxiv.org/html/2605.26620#S3.SS4.SSS0.Px2.p1.1)\.
- R\. Thoppilan, D\. D\. Freitas, J\. Hall, N\. Shazeer, A\. Kulshreshtha, H\. Cheng, A\. Jin, T\. Bos, L\. Baker, Y\. Du, Y\. Li, H\. Lee, H\. S\. Zheng, A\. Ghafouri, M\. Menegali, Y\. Huang, M\. Krikun, D\. Lepikhin, J\. Qin, D\. Chen, Y\. Xu, Z\. Chen, A\. Roberts, M\. Bosma, V\. Zhao, Y\. Zhou, C\. Chang, I\. Krivokon, W\. Rusch, M\. Pickett, P\. Srinivasan, L\. Man, K\. Meier\-Hellstern, M\. R\. Morris, T\. Doshi, R\. D\. Santos, T\. Duke, J\. Soraker, B\. Zevenbergen, V\. Prabhakaran, M\. Diaz, B\. Hutchinson, K\. Olson, A\. Molina, E\. Hoffman\-John, J\. Lee, L\. Aroyo, R\. Rajakumar, A\. Butryna, M\. Lamm, V\. Kuzmina, J\. Fenton, A\. Cohen, R\. Bernstein, R\. Kurzweil, B\. Aguera\-Arcas, C\. Cui, M\. Croak, E\. Chi, and Q\. Le \(2022\)LaMDA: Language Models for Dialog Applications\.arXiv\.Note:arXiv:2201\.08239 \[cs\]External Links:[Link](http://arxiv.org/abs/2201.08239),[Document](https://dx.doi.org/10.48550/arXiv.2201.08239)Cited by:[§1](https://arxiv.org/html/2605.26620#S1.p3.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px4.p1.1),[§6](https://arxiv.org/html/2605.26620#S6.SS0.SSS0.Px4.p1.1)\.
- D\. Vrandečić and M\. Krötzsch \(2014\)Wikidata: a free collaborative knowledgebase\.Commun\. ACM57\(10\),pp\. 78–85\.External Links:ISSN 0001\-0782,[Link](https://dl.acm.org/doi/10.1145/2629489),[Document](https://dx.doi.org/10.1145/2629489)Cited by:[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px3.p1.1)\.
- W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou \(2020\)MiniLM: deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 5776–5788\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)Cited by:[Table 4](https://arxiv.org/html/2605.26620#A2.T4.1.3.1)\.
- M\. Wanner, B\. V\. Durme, and M\. Dredze \(2024\)DnDScore: Decontextualization and Decomposition for Factuality Verification in Long\-Form Text Generation\.arXiv\.Note:arXiv:2412\.13175 \[cs\]External Links:[Link](http://arxiv.org/abs/2412.13175),[Document](https://dx.doi.org/10.48550/arXiv.2412.13175)Cited by:[§3\.2](https://arxiv.org/html/2605.26620#S3.SS2.p2.1)\.
- J\. Wei, N\. Karina, H\. W\. Chung, Y\. J\. Jiao, S\. Papay, A\. Glaese, J\. Schulman, and W\. Fedus \(2024\)Measuring short\-form factuality in large language models\.arXiv\.Note:arXiv:2411\.04368 \[cs\]Comment: Blog post: https://openai\.com/index/introducing\-simpleqa/External Links:[Link](http://arxiv.org/abs/2411.04368),[Document](https://dx.doi.org/10.48550/arXiv.2411.04368)Cited by:[Appendix L](https://arxiv.org/html/2605.26620#A12.p4.1),[§5](https://arxiv.org/html/2605.26620#S5.p1.1),[§5](https://arxiv.org/html/2605.26620#S5.p3.1)\.
- T\. Wu, J\. Ni, B\. Hooi, J\. Zhang, E\. Ash, S\. Ng, M\. Sachan, and M\. Leippold \(2025\)Balancing Truthfulness and Informativeness with Uncertainty\-Aware Instruction Fine\-Tuning\.arXiv\.Note:arXiv:2502\.11962 \[cs\]External Links:[Link](http://arxiv.org/abs/2502.11962),[Document](https://dx.doi.org/10.48550/arXiv.2502.11962)Cited by:[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 Technical Report\.arXiv\.Note:arXiv:2505\.09388 \[cs\]External Links:[Link](http://arxiv.org/abs/2505.09388),[Document](https://dx.doi.org/10.48550/arXiv.2505.09388)Cited by:[Table 4](https://arxiv.org/html/2605.26620#A2.T4.1.10.1),[Table 4](https://arxiv.org/html/2605.26620#A2.T4.1.6.1),[Table 4](https://arxiv.org/html/2605.26620#A2.T4.1.8.1),[Table 4](https://arxiv.org/html/2605.26620#A2.T4.1.9.1),[§5](https://arxiv.org/html/2605.26620#S5.p2.1)\.
- G\. Yona, R\. Aharoni, and M\. Geva \(2024\)Narrowing the Knowledge Evaluation Gap: Open\-Domain Question Answering with Multi\-Granularity Answers\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 6737–6751\.External Links:[Link](https://aclanthology.org/2024.acl-long.365/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.365)Cited by:[2nd item](https://arxiv.org/html/2605.26620#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.26620#S1.p2.1),[§2](https://arxiv.org/html/2605.26620#S2.SS0.SSS0.Px3.p3.1),[§3\.1](https://arxiv.org/html/2605.26620#S3.SS1.p1.1),[§6](https://arxiv.org/html/2605.26620#S6.SS0.SSS0.Px3.p4.1)\.

## Appendix AExample Sentences

We present example sentences with controlled granularity levels together with their assigned Granuscore values in[Figure 6](https://arxiv.org/html/2605.26620#A1.F6),[Figure 7](https://arxiv.org/html/2605.26620#A1.F7), and[Figure 8](https://arxiv.org/html/2605.26620#A1.F8)\. They illustrate how the same underlying fact can be expressed at different levels of granularity\.

![Refer to caption](https://arxiv.org/html/2605.26620v1/x6.png)Figure 6:Illustration of semantic abstraction\. Starting from the specific statement“He fixed his CUBE road bike using a rusty wrench”, it can be generalized by abstracting the vehicle and the instrument\.![Refer to caption](https://arxiv.org/html/2605.26620v1/x7.png)Figure 7:Illustration of semantic abstraction\. Starting from the specific statement“I bought a cappuccino at the small Italian café”, it can be generalized by abstracting the drink type and the venue\.![Refer to caption](https://arxiv.org/html/2605.26620v1/x8.png)Figure 8:Illustration of semantic abstraction\. Starting from the specific statement“He sits on his old wooden chair”, it can be generalized by abstracting the seating option\.
## Appendix BModel Access

To support reproducibility,[Table 4](https://arxiv.org/html/2605.26620#A2.T4)lists all models used in this paper, including their names, exact versions, and access providers\.

NameVersionAccess ProviderHiT\(Chenet al\.,[2024](https://arxiv.org/html/2605.26620#bib.bib25)\)HiT\-MiniLM\-L12\-WordNetNounLocalMiniLM\(Wanget al\.,[2020](https://arxiv.org/html/2605.26620#bib.bib61)\)all\-MiniLM\-L6\-v2111[https://huggingface\.co/sentence\-transformers/all\-MiniLM\-L6\-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)LocalGPT\-4\.1 nano\(OpenAI,[2025](https://arxiv.org/html/2605.26620#bib.bib60)\)gpt\-4\.1\-nano\-2025\-04\-14OpenAI Batch APIGPT\-4\.1 mini\(OpenAI,[2025](https://arxiv.org/html/2605.26620#bib.bib60)\)gpt\-4\.1\-mini\-2025\-04\-14OpenAI Batch APIQwen3 0\.6B\(Yanget al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib24)\)N/ALocalOlmo 3\(Olmoet al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib47)\)Olmo\-3\-7B\-InstructLocalQwen3 8B\(Yanget al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib24)\)N/ALocalQwen3 8B Think\(Yanget al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib24)\)N/ALocalQwen3 32B\(Yanget al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib24)\)N/AOpenRouterDeepSeek V3\.2\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib2)\)N/AOpenRouterTable 4:Specific model versions used in our experiments\. For each model we provide the exact version and the access provider\.
## Appendix CReference Construction

As the embedding space is unbounded, we approximate its structure using a finite subset of entities\. We construct this proxy space from 50,000 randomly sampled Wikidata entities222https://huggingface\.co/datasets/philippesaade/wikidata\. For each entity, we use its title as the textual representation\. Wikidata offers broad topical coverage and a relatively clean entity structure, making it a suitable general\-purpose semantic reference\. We choose an index size of 50,000 entities as a trade\-off between computational efficiency and neighborhood fidelity\. In preliminary experiments, this size yielded stable neighborhood structures, while larger indices substantially increased runtime \(cf\.[Appendix G](https://arxiv.org/html/2605.26620#A7)\)\.

All entity embeddings are indexed usingFAISS\(Douzeet al\.,[2025](https://arxiv.org/html/2605.26620#bib.bib26)\)\. For the Random Anchor and Radial Anchor methods, anchors are sampled once in advance from the index and reused for all queries\. For the Nearest Neighbor method, we retrieve the top nearest neighbors using cosine similarity for each query individually\. For the Random Neighbor baseline, neighbors are sampled at random from the index for each query\.

## Appendix DLLM Prompt

We use the following few\-shot prompt to annotate granularity levels with an LLM\. The prompt includes three example semantic hierarchies comprising a total of 14 realizations together with their expected granularity levels\.

User Prompt: LLM GranularityYou are an expert annotator for granularity\. Your task is to assign a granularity score to an answer using a 4\-point Likert scale\. Granularity refers to how fine\- / coarse\-grained an answer is\. Always assign exactly one score: 1, 2, 3, or 4\. \-\-\-\{examples\}\-\-\- Now assign a granularity score\. Output only the score\. Answer: "\{answer\}" Granularity:

## Appendix ELightGBM Model Training

We train Granuscore using a LightGBM regression model onGRANOLA\-EQ\. Hyperparameters are selected with Optuna\(Akibaet al\.,[2019](https://arxiv.org/html/2605.26620#bib.bib37)\), using 50 optimization trials on the development split\. The final hyperparameter configuration is shown in[Table 5](https://arxiv.org/html/2605.26620#A5.T5)\.

ParameterValueBoosting typeGBDTObjectiveRegressionEvaluation metricRMSEEarly stopping200Number of iterations10,000Learning rate0\.0257596Number of leaves138Maximum depthUnlimitedMinimum data in leaf57Feature fraction0\.751449Bagging fraction0\.638041Bagging frequency7Dropout rate \(DART\)0\.1Maximum bins255DeviceCPUTable 5:LightGBM hyperparameters used for training Granuscore\.
## Appendix FGranola\-EQ

Rel\.Question TemplateP112Who founded \[X\]?P127Who owns \[X\]?P131Where is \[X\] located?P159Where is the headquarter of \[X\]?P170Who created \[X\]?P175Who performed \[X\]?P176Which company is \[X\] produced by?P19Where was \[X\] born?P20Where did \[X\] die?P26Who is \[X\] married to?P264What music label represents \[X\]?P276Where is \[X\] located?P40Who is \[X\]’s child?P50Who is the author of \[X\]?P69Where was \[X\] educated?P740Where was \[X\] founded?Table 6:Question template for each relation type in the dataset\.Rel\.TrainDevTestP206503649P196354564P696004830P2765852855P1594967667P2648258142P1314529661P17644473124P5044364104P17043910556P2643986895P12736183147P4033783127P1122353857P1757931535P7406647Total670212201220Table 7:Distribution of relation types across the training, development, and test splits ofGranola\-EQ\.Granola\-EQcovers multiple relation types, represented through masked question templates\.[Table 6](https://arxiv.org/html/2605.26620#A6.T6)lists the relation categories included in the dataset together with their corresponding question templates\.[Table 7](https://arxiv.org/html/2605.26620#A6.T7)reports the distribution of relation types across the training, development, and test splits\. The dataset covers a diverse range of relations, with location\- and person\-based questions forming the largest categories\. As shown in the table, the relative proportions differ across splits\. This variation arises from enforcing a strict split\-by\-granularity\-realization rule, ensuring that the same realization does not appear in multiple splits and preventing data leakage between training and evaluation\.

### F\.1Distribution of Granularity Resolution

Gran\. ResolutionCount1206219663565041320Mean2\.88Variance0\.44Table 8:Distribution of granularity resolution inGranola\-EQ\. The granularity resolution indicates the number of distinct granularity levels available for an entity\.[Table 8](https://arxiv.org/html/2605.26620#A6.T8)shows the distribution of granularity resolution across the full dataset\. Granularity resolution denotes the number of levels available for a given entity\. The mean and variance indicate that most entities exhibit a moderate number of granularity levels, with the majority containing three levels\.

### F\.2Context\-dependent granularity variation

Gran\. LevelCount122132\.511431604198Mean3\.25Variance0\.44Table 9:Distribution of normalized granularity levels for the realization*England*inGranola\-EQ\.Granularity is context\-dependent: the same entity may occur at different levels of abstraction depending on the question\. As described in[Section 3\.1](https://arxiv.org/html/2605.26620#S3.SS1),GRANOLA\-EQpreserves these contextual variations\.[Table 9](https://arxiv.org/html/2605.26620#A6.T9)illustrates this effect for the entity*England*, showing that identical entities appear at multiple normalized granularity levels across dataset entries\. Granuscore therefore reflects an aggregated, context\-averaged view of granularity suitable for global analysis\.

### F\.3Granuscore Across Annotated Granularity Levels

Level122\.534Percentile28\.5447\.3357\.2764\.1277\.29Raw1\.551\.992\.192\.412\.79Table 10:Average Granuscore per normalized granularity level on theGRANOLA\-EQtest set using the HiT Random Anchor method\. We report both the percentile score and the corresponding raw value\.To examine whether Granuscore meaningfully distinguishes between granularity levels,[Table 10](https://arxiv.org/html/2605.26620#A6.T10)reports the average raw and percentile scores for each normalized level in theGranola\-EQtest set\.

Both the raw Granuscore values and their percentile equivalents increase consistently with the annotated granularity levels\. The distances between adjacent levels are relatively uniform, indicating that Granuscore reflects the intended ordering of abstraction levels\. This regular spacing suggests that Granuscore provides an interpretable scale of semantic granularity rather than producing small, arbitrary numerical differences\. Pairwise comparisons between all levels are statistically significant \(one\-sided Wilcoxon signed\-rank test; lower granularity<<higher granularity;p≤6\.4×10−6p\\leq 6\.4\\times 10^\{\-6\}\)\.

### F\.4Additional Metrics

MethodKendallτ\\tauPearsonrrIntraKendallτ\\tauWord Count4\.151\.077\.91WordNet†25\.5428\.1561\.15GPT\-4\.1 mini69\.8474\.0488\.24HiT Dist050\.7052\.3775\.73MiniLM NN28\.6536\.9439\.00MiniLM Random28\.7937\.8741\.14MiniLM RandomAnch28\.7337\.7642\.31HiT NN52\.2865\.1372\.93HiT Random50\.5364\.4169\.49HiT RadialAnch53\.1666\.0476\.70HiT RandomAnch54\.2867\.5876\.30HiT Dist0 \+ NN54\.0467\.5574\.05HiT Dist0 \+ Random52\.5166\.6773\.65HiT Dist0 \+ RadialAnch54\.6668\.9177\.66HiT Dist0 \+ RandomAnch55\.5369\.6978\.06Table 11:Kendall’sτ\\tau, Pearsonrr, and Intra\-sample Kendall’sτ\\tauonGRANOLA\-EQ\. Kendall’sτ\\taumeasures global ordering across all answers, while Intra\-sample Kendall’sτ\\taumeasures ordering within individual questions\.†: WordNet could only derive a hierarchy for 17\.61% of the answers in the test set\.[Table 11](https://arxiv.org/html/2605.26620#A6.T11)reports additional evaluation metrics onGranola\-EQ, including Kendall’sτ\\tau, Pearson’srr, and Intra\-sample Kendall’sτ\\tau\. Kendall’sτ\\tauand Pearson’srrare computed across realizations from all samples, measuring how well predicted granularity scores follow the global ordering of annotated levels across unrelated entities\. In contrast, Intra\-sample Kendall’sτ\\tauis computed within each question and then averaged, reflecting how well a method preserves the ordering of realizations within individual hierarchies\.

Under these metrics, the LLM baseline \(GPT\-4\.1 mini\) achieves the highest scores, followed by HiT Random Anchor\. However, this result should be interpreted with caution\. The LLM frequently assigns identical granularity levels to multiple answers, resulting in many tied comparisons\. Since Kendall’sτ\\tauexcludes tied pairs from the computation, these ties effectively remove more difficult comparisons and leave only easier ordering decisions, artificially inflating the score\.

For this reason, Kendall\-based metrics can overestimate the performance of models that produce many tied predictions\. To provide a more faithful evaluation of hierarchy recovery, we therefore report Pairwise Accuracy and Exact Ordering Accuracy in[Section 4\.1](https://arxiv.org/html/2605.26620#S4.SS1), which evaluate the ordering of all answer pairs and penalize tied predictions\.

## Appendix GAblation Study on the Number of References

Reference SizePW Acc\.3383\.376683\.169983\.1333383\.3066682\.4499983\.76133282\.47166583\.45Table 12:Ablation study on the number of reference anchors onGranola\-EQusing Random Anchors\. Bold indicates the best result, and italics the second\-best result across anchor sizes\.We analyze the effect of the number of reference anchors when using the Random Anchor method\.[Table 12](https://arxiv.org/html/2605.26620#A7.T12)reports pairwise accuracy on theGranola\-EQtest split while varying the anchor set size\.

Performance remains largely stable across different anchor sizes, indicating that the method is not highly sensitive to this parameter\. The best performance is obtained with 999 anchors, which we therefore use as the default configuration in our experiments\.

## Appendix HAblation Study on Aggregation Strategy

Because large\-scale annotations of granularity for longer texts are difficult to obtain, we use our scientific papers testbed to determine an effective aggregation strategy\. We compare several aggregation operators, includingmean,weighted mean,sum,min,max, andlower quantile mean\(lqm\)\. The lower quantile mean with thresholdqqaverages only the lowestqqproportion of unit\-level scores within a section \(e\.g\.,lqm​\(0\.3\)\\text\{lqm\}\(0\.3\)averages the lowest 30% of scores\)\.

[Table 14](https://arxiv.org/html/2605.26620#A12.T14)reports ordering accuracy for the different aggregation strategies\. The best performance is achieved by a two\-step aggregation procedure\. First, we compute a sentence\-level Granuscore by averaging the scores of the extracted referential units within each sentence\. We then aggregate across sentences by taking the mean of the lowest 80% of sentence\-level Granuscores, which reduces the influence of unusually high values\.

Some aggregation variants perform substantially worse than others for methodological reasons\. In particular, max\-based strategies \(e\.g\.,sent\-max\-pool\-max,doc\-pool\-max\) reduce an entire document to a single referential unit, effectively ignoring most of the content\. Since many sentences contain at least some coarse or vague elements, these methods systematically bias scores toward coarse\-grained representations and therefore provide poor discrimination\.

For sum\-based strategies \(e\.g\.,doc\-pool\-sumandsent\-sum\-\*\), the issue is different: scores accumulate additively across sentences, causing document\-level granularity estimates to scale with text length\. This behavior conflicts with our notion of granularity, which is determined by the hierarchical level of referential expressions rather than the amount of information conveyed by a text\.

For completeness, we additionally evaluate all aggregation strategies in generalized additive models predicting sentence specificity from length and granularity\.[Table 15](https://arxiv.org/html/2605.26620#A12.T15)reports the corresponding improvements in explained deviance\. The rankings induced by ordering accuracy and explained deviance are moderately correlated \(Pearsonr=0\.62r=0\.62\), indicating that aggregation strategies that better recover hierarchical ordering also tend to better explain sentence specificity\.

## Appendix ICorrelation to Sentence Specificity

![Refer to caption](https://arxiv.org/html/2605.26620v1/x9.png)Figure 9:Effect of Length on sentence specificity across domains\. Longer sentences correspond to more specific sentences\. The plotted range is restricted to the 1st–99th percentiles of Granuscore to avoid sparse\-support regions\.The sentence specificity datasets include 920 movie review sentences, 984 Twitter posts, and 845 Yelp reviews fromKoet al\.\([2019](https://arxiv.org/html/2605.26620#bib.bib16)\), as well as 573 news sentences fromLiet al\.\([2016](https://arxiv.org/html/2605.26620#bib.bib11)\)\.

For completeness,[Figure 4](https://arxiv.org/html/2605.26620#S4.F4)shows the estimated effect of sentence length \(measured as word count\) on sentence specificity across domains, restricted to the data\-supported Granuscore range\. As expected, we observe a negative relationship: as sentence length increases, specificity scores decrease, indicating more specific sentences\.

The strength of this effect varies by domain\. For Twitter, specificity decreases most rapidly with increasing length, followed by movie reviews, Yelp reviews, and news articles\. This pattern reflects domain\-specific length distributions: Twitter texts are typically much shorter than those in other domains, while news articles tend to be longer and more descriptive\.

## Appendix JScientific Papers as Discourse Contexts

We compare paragraphs from theIntroductionandRelated Worksections\. These sections are selected because their communicative roles are well defined: the Introduction typically presents the research problem and context, whereas the Related Work section situates the contribution within existing literature\.

We sample the first 1,000 papers from theS2ORCcorpus that contain standardIntroduction,Related Work, andConclusionsections, ensuring a consistent and well\-structured discourse layout\. Before analysis, we remove bracketed text, URLs, figure captions, and common PDF/OCR artifacts\.

To obtain comparable text segments across papers, we apply a simple paragraph selection procedure\. For the Introduction, we select the first paragraph containing at least ten referential units\. For the Related Work section, we skip the opening paragraph, as it often functions as a brief transition, and instead select the first subsequent paragraph meeting this criterion\. If no such paragraph exists, we fall back to the opening paragraph if it also satisfies the requirement\. This procedure yields 978 papers for comparison\.

## Appendix KAdditional QA Analyses and Potential Confounding Factors

#### Question Granularity

Beyond the correlation between the Granuscore of gold answers and model correctness reported in[section 5](https://arxiv.org/html/2605.26620#S5), we also analyze the Granuscore of the corresponding questions\.[Figure 10](https://arxiv.org/html/2605.26620#A11.F10)shows a similar trend: datasets with lower question Granuscore are associated with lower correctness\. This pattern is consistent across models\. As for gold answers \([Figure 5](https://arxiv.org/html/2605.26620#S5.F5)\), the smallest model \(Qwen3 0\.6B\) exhibits a weaker slope and partial saturation, whereasDeepSeek V3\.2follow the same overall trend but at higher accuracy levels\.

![Refer to caption](https://arxiv.org/html/2605.26620v1/x10.png)Figure 10:Relationship between dataset\-level question Granuscore and model correctness across QA benchmarks\. Higher Granuscore datasets are associated with higher correctness across models\. All pairwise differences in Granuscore between datasets are statistically significant \(Mann–WhitneyUU,p≤3\.2×10−30p\\leq 3\.2\\times 10^\{\-30\}\)\.
#### Potential Confounding Factors

Word FrequencyTree DepthLengthDatasetAccuracyAnswerQuestionAnswerQuestionAnswerQuestionFACTS Parametric0\.0500\.0500\.0180\.0180\.1530\.1532\.2652\.2653\.5183\.5183\.1083\.1086\.1986\.198SimpleQA0\.0780\.0780\.0220\.0220\.5100\.5101\.8591\.8596\.3156\.3152\.2402\.24016\.31016\.310SQuAD0\.3020\.3020\.0620\.0620\.6620\.6622\.2492\.2494\.9394\.9392\.9572\.95710\.19810\.198TruthfulQA0\.4350\.4351\.0511\.0510\.8080\.8084\.2444\.2444\.7754\.7759\.1189\.11810\.62010\.620Table 13:Dataset\-level statistics for alternative properties potentially related to QA difficulty\. Correctness corresponds to mean model correctness across evaluated models\.We additionally analyze several alternative properties potentially related to QA difficulty: answer and question length, word frequency, and syntactic complexity\. For word frequency, we compute the average token frequency usingwordfreq\(Speeret al\.,[2022](https://arxiv.org/html/2605.26620#bib.bib62)\)\. For syntactic complexity, we measure average dependency tree depth using spaCy parses\(Honnibalet al\.,[2020](https://arxiv.org/html/2605.26620#bib.bib29)\)\.[Table 13](https://arxiv.org/html/2605.26620#A11.T13)reports the corresponding dataset\-level statistics\.

We observe a mild relationship between correctness and word frequency, with lower\-performing datasets generally containing rarer terms\. However, this effect is also partially related to granularity itself, since fine\-grained concepts tend to be less frequent\. In contrast, syntactic complexity and length\-based measures do not exhibit comparably consistent relationships with correctness\. Together, these findings suggest that the observed Granuscore trends are not explained solely by superficial textual properties\.

## Appendix LQA Generation and Evaluation

For answer generation, we use a maximum length of 512 tokens for standard and 2,048 tokens for reasoning\-based generation, with the temperature set to 0\. We instruct the model to produce answers of at most five sentences\. We retain only responses that terminate before the token limit, ensuring all evaluated outputs are complete and not truncated\.

Models are instructed to produce answers of at most five sentences using the following prompt:

User Prompt: Answer GenerationAnswer the following query in at most five complete sentences: <query\>

Model responses are evaluated using GPT\-4\.1 nano as an LLM\-based judge, following the prompt template introduced inSimpleQA\(Weiet al\.,[2024](https://arxiv.org/html/2605.26620#bib.bib22)\)\.

AggregationOrd\. Acc\.AggregationOrd\. Acc\.AggregationOrd\. Acc\.sent\-weighted\-mean\-pool\-sum58\.49sent\-lqm\-0\.9\-pool\-sum60\.84sent\-min\-pool\-sum60\.02sent\-weighted\-mean\-pool\-mean66\.46sent\-lqm\-0\.8\-pool\-sum62\.27sent\-min\-pool\-mean66\.16sent\-weighted\-mean\-pool\-lqm\-0\.162\.68sent\-lqm\-0\.7\-pool\-sum62\.68sent\-min\-pool\-lqm\-0\.161\.76sent\-weighted\-mean\-pool\-lqm\-0\.366\.87sent\-lqm\-0\.9\-pool\-mean68\.10sent\-min\-pool\-lqm\-0\.364\.01sent\-weighted\-mean\-pool\-lqm\-0\.567\.28sent\-lqm\-0\.8\-pool\-mean68\.71sent\-min\-pool\-lqm\-0\.565\.03sent\-weighted\-mean\-pool\-min61\.55sent\-lqm\-0\.7\-pool\-mean67\.59sent\-min\-pool\-min59\.82sent\-weighted\-mean\-pool\-max58\.79sent\-lqm\-0\.9\-pool\-lqm\-0\.163\.80sent\-min\-pool\-max60\.53sent\-sum\-pool\-sum47\.55sent\-lqm\-0\.9\-pool\-lqm\-0\.366\.16sent\-max\-pool\-sum52\.66sent\-sum\-pool\-mean45\.19sent\-lqm\-0\.9\-pool\-lqm\-0\.566\.87sent\-max\-pool\-mean54\.19sent\-sum\-pool\-lqm\-0\.148\.67sent\-lqm\-0\.8\-pool\-lqm\-0\.163\.80sent\-max\-pool\-lqm\-0\.153\.58sent\-sum\-pool\-lqm\-0\.347\.03sent\-lqm\-0\.8\-pool\-lqm\-0\.367\.59sent\-max\-pool\-lqm\-0\.354\.91sent\-sum\-pool\-lqm\-0\.545\.71sent\-lqm\-0\.8\-pool\-lqm\-0\.567\.89sent\-max\-pool\-lqm\-0\.556\.85sent\-sum\-pool\-min48\.67sent\-lqm\-0\.7\-pool\-lqm\-0\.163\.29sent\-max\-pool\-min52\.56sent\-sum\-pool\-max44\.38sent\-lqm\-0\.7\-pool\-lqm\-0\.367\.38sent\-max\-pool\-max48\.47sent\-mean\-pool\-sum60\.53sent\-lqm\-0\.7\-pool\-lqm\-0\.567\.59doc\-pool\-sum47\.55sent\-mean\-pool\-mean67\.48sent\-lqm\-0\.9\-pool\-min62\.27doc\-pool\-mean66\.36sent\-mean\-pool\-lqm\-0\.162\.88sent\-lqm\-0\.8\-pool\-min62\.07doc\-pool\-lqm\-0\.164\.83sent\-mean\-pool\-lqm\-0\.366\.05sent\-lqm\-0\.7\-pool\-min62\.07doc\-pool\-lqm\-0\.365\.64sent\-mean\-pool\-lqm\-0\.566\.16sent\-lqm\-0\.9\-pool\-max59\.41doc\-pool\-lqm\-0\.566\.05sent\-mean\-pool\-min62\.07sent\-lqm\-0\.8\-pool\-max60\.33doc\-pool\-min59\.92sent\-mean\-pool\-max59\.10sent\-lqm\-0\.7\-pool\-max59\.82doc\-pool\-max47\.75Table 14:Accuracy of section ordering \(Introduction\>\>Related Work\) under different aggregation strategies\. Aggregation names follow the patternscope\-aggregation\-pool\.scopeindicates whether aggregation is performed at the document level \(doc\) or sentence level \(sent\)\. For sentence\-level strategies, the first operator aggregates scores across sentences \(e\.g\.,sent\-mean\)\. Thepooloperator specifies how Granuscores of referential units within a sentence are combined \(e\.g\.,sent\-mean\-pool\-sumfirst sums Granuscores within each sentence and then averages across sentences\)\. Bold and italics denote the best and second\-best results, respectively\.MethodΔ\\DeltaExpl\. Dev\.MethodΔ\\DeltaExpl\. Dev\.MethodΔ\\DeltaExpl\. Dev\.sent\-sum\-pool\-sum3\.67sent\-lqm\-0\.9\-pool\-lqm\-0\.39\.20sent\-max\-pool\-sum2\.74sent\-sum\-pool\-mean6\.75sent\-lqm\-0\.9\-pool\-lqm\-0\.59\.20sent\-max\-pool\-mean7\.09sent\-sum\-pool\-lqm\-0\.16\.64sent\-lqm\-0\.8\-pool\-lqm\-0\.19\.20sent\-max\-pool\-lqm\-0\.17\.81sent\-sum\-pool\-lqm\-0\.37\.27sent\-lqm\-0\.8\-pool\-lqm\-0\.39\.24sent\-max\-pool\-lqm\-0\.37\.68sent\-sum\-pool\-lqm\-0\.57\.12sent\-lqm\-0\.8\-pool\-lqm\-0\.59\.24sent\-max\-pool\-lqm\-0\.57\.71sent\-sum\-pool\-min6\.10sent\-lqm\-0\.7\-pool\-lqm\-0\.19\.23sent\-max\-pool\-min7\.28sent\-sum\-pool\-max1\.51sent\-lqm\-0\.7\-pool\-lqm\-0\.39\.35sent\-max\-pool\-max2\.00sent\-mean\-pool\-sum2\.55sent\-lqm\-0\.7\-pool\-lqm\-0\.59\.29doc\-pool\-sum3\.49sent\-mean\-pool\-mean8\.34sent\-lqm\-0\.9\-pool\-min8\.76doc\-pool\-mean8\.56sent\-mean\-pool\-lqm\-0\.19\.19sent\-lqm\-0\.8\-pool\-min8\.77doc\-pool\-lqm\-0\.19\.84sent\-mean\-pool\-lqm\-0\.39\.20sent\-lqm\-0\.7\-pool\-min8\.78doc\-pool\-lqm\-0\.39\.52sent\-mean\-pool\-lqm\-0\.59\.20sent\-lqm\-0\.9\-pool\-max1\.87doc\-pool\-lqm\-0\.59\.48sent\-mean\-pool\-min8\.76sent\-lqm\-0\.8\-pool\-max1\.86doc\-pool\-min9\.51sent\-mean\-pool\-max1\.87sent\-lqm\-0\.7\-pool\-max1\.85doc\-pool\-max1\.88sent\-lqm\-0\.9\-pool\-sum2\.55sent\-min\-pool\-sum1\.95sent\-weighted\-mean\-pool\-sum2\.65sent\-lqm\-0\.8\-pool\-sum2\.56sent\-min\-pool\-mean8\.07sent\-weighted\-mean\-pool\-mean8\.62sent\-lqm\-0\.7\-pool\-sum2\.50sent\-min\-pool\-lqm\-0\.19\.98sent\-weighted\-mean\-pool\-lqm\-0\.19\.14sent\-lqm\-0\.9\-pool\-mean8\.34sent\-min\-pool\-lqm\-0\.39\.42sent\-weighted\-mean\-pool\-lqm\-0\.39\.41sent\-lqm\-0\.8\-pool\-mean8\.33sent\-min\-pool\-lqm\-0\.59\.18sent\-weighted\-mean\-pool\-lqm\-0\.59\.54sent\-lqm\-0\.7\-pool\-mean8\.33sent\-min\-pool\-min9\.53sent\-weighted\-mean\-pool\-min8\.69sent\-lqm\-0\.9\-pool\-lqm\-0\.19\.19sent\-min\-pool\-max1\.84sent\-weighted\-mean\-pool\-max1\.90Table 15:Ablation over aggregation strategies measured by the improvement in explained deviance \(Δ\\DeltaExpl\. Dev\.,×100\\times 100\) relative to a length\-only baseline for sentence specificity\. Aggregation names follow the patternscope\-aggregation\-pool\.scopeindicates whether aggregation is performed at the document level \(doc\) or sentence level \(sent\)\. For sentence\-level strategies, the first operator aggregates scores across sentences, whilepoolspecifies how Granuscores of referential units within each sentence are combined\. For example,sent\-mean\-pool\-sumfirst sums Granuscores within each sentence and then averages across sentences\. Bold and italics denote the best and second\-best results, respectively\.

Similar Articles

GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning

arXiv cs.AI

GraphARC is a new benchmark for abstract reasoning on graph-structured data, extending the ARC paradigm to graphs. Evaluations of state-of-the-art language models reveal a comprehension-execution gap and performance degradation on larger instances, highlighting scaling challenges.