Narrative-UFET: Narrative Generation for Ultra-Fine Entity Typing

arXiv cs.CL 06/29/26, 04:00 AM Papers
ultra-fine-entity-typing narrative-generation discourse long-tail-types nlp entity-typing research
Summary
This paper introduces Narrative-UFET, a method that generates short narratives to provide broader context for ultra-fine entity typing, improving performance on long-tail types compared to sentence-level baselines.
arXiv:2606.27598v1 Announce Type: new Abstract: Ultra-fine entity typing (UFET) assigns highly specific types to entity mentions, but current approaches struggle with types in the long tail. We hypothesize that a key limitation is the reliance on sentence-level context, since disambiguating evidence is often spread across multiple sentences. Testing this has been difficult because all existing UFET resources are sentence-level. We present Narrative-UFET, a controlled extension of UFET in which each entity mention is paired with an automatically generated short, coherent narrative. Synthesizing narratives lets us isolate the effect of specific discourse properties. We experiment with two paired variants: one in which the entity's type is held constant across the narrative (Maintain) and one in which it shifts (Change). We show that narrative context yields consistent improvements on long-tail types over sentence-level baselines, with the Change variant providing the stronger signal. A comparison against naturally occurring contexts shows that synthetic narratives yield stronger gains, indicating that controlled discourse construction can surface signals that real text leaves implicit. Substantial room for improvement remains, suggesting open directions in both discourse modeling and narrative construction.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:23 AM
# Narrative-UFET: Narrative Generation for Ultra-Fine Entity Typing
Source: [https://arxiv.org/html/2606.27598](https://arxiv.org/html/2606.27598)
Mreedul Gupta Advait Deshmukh Ashwin Umadi Matt Pauk Maria Leonor Pacheco University of Colorado Boulder \{mreedul\.gupta, advait\.deshmukh, ashwin\.umadi, matt\.pauk, maria\.pacheco\}@colorado\.edu

###### Abstract

Ultra\-fine entity typing \(UFET\) assigns highly specific types to entity mentions, but current approaches struggle with types in the long tail\. We hypothesize that a key limitation is the reliance on sentence\-level context, since disambiguating evidence is often spread across multiple sentences\. Testing this has been difficult because all existing UFET resources are sentence\-level\. We present Narrative\-UFET, a controlled extension of UFET in which each entity mention is paired with an automatically generated short, coherent narrative\. Synthesizing narratives lets us isolate the effect of specific discourse properties\. We experiment with two paired variants: one in which the entity’s type is held constant across the narrative \(Maintain\) and one in which it shifts \(Change\)\. We show that narrative context yields consistent improvements on long\-tail types over sentence\-level baselines, with theChangevariant providing the stronger signal\. A comparison against naturally occurring contexts shows that synthetic narratives yield stronger gains, indicating that controlled discourse construction can surface signals that real text leaves implicit\. Substantial room for improvement remains, suggesting open directions in both discourse modeling and narrative construction\.

Narrative\-UFET: Narrative Generation for Ultra\-Fine Entity Typing

Mreedul Gupta Advait Deshmukh Ashwin Umadi Matt Pauk Maria Leonor PachecoUniversity of Colorado Boulder\{mreedul\.gupta, advait\.deshmukh, ashwin\.umadi, matt\.pauk, maria\.pacheco\}@colorado\.edu

## 1Introduction

Ultra\-fine entity typing \(UFET\) is the task of assigning highly specific types to entity mentions based on the context in which they appearChoi et al\. \([2018](https://arxiv.org/html/2606.27598#bib.bib2)\)\. Unlike coarse\-grained typing, which selects from a small set of broad categories such aspersonororganization, UFET aims to capture contextually relevant distinctions: in the sentence“In his next note on Baidu, he wrote that the company is trading above its fair value,”, the pronoun “he” could be typed coarsely as aperson, but the surrounding context supports more specific types, such aswriter,editor, oranalyst\. Identifying these fine\-grained types is valuable for a wide range of downstream tasks, including coreference resolutionOnoe and Durrett \([2020](https://arxiv.org/html/2606.27598#bib.bib19)\), entity linkingLing et al\. \([2015](https://arxiv.org/html/2606.27598#bib.bib16)\), relation extractionKoch et al\. \([2014](https://arxiv.org/html/2606.27598#bib.bib12)\), knowledge graph completionLi et al\. \([2024b](https://arxiv.org/html/2606.27598#bib.bib14)\)and multi\-modal entity recognition and groundingWang et al\. \([2024](https://arxiv.org/html/2606.27598#bib.bib25)\); Li et al\. \([2024a](https://arxiv.org/html/2606.27598#bib.bib13)\)\.

A central challenge for UFET is the long tail of the entity distribution\. Pretrained language models \(PLMs\), which most current approaches build onDai et al\. \([2021](https://arxiv.org/html/2606.27598#bib.bib3)\); Li et al\. \([2023](https://arxiv.org/html/2606.27598#bib.bib15)\); Deshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\), perform well on entities that appear frequently in their pretraining corpora but degrade sharply on rare onesDeshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\)\. This is precisely the setting where fine\-grained disambiguation is most needed\. We hypothesize that part of this gap stems from a reliance on sentence\-level context\. In realistic settings such as articles or books, the evidence needed to disambiguate fine\-grained types is rarely contained in a single sentence\. Instead, it must be pieced together from cues distributed across the surrounding narrative\. Testing this hypothesis has been difficult, however, because all existing annotated resources for fine and ultra\-fine entity typing are sentence\-levelRiedel et al\. \([2010](https://arxiv.org/html/2606.27598#bib.bib22)\); Gillick et al\. \([2014b](https://arxiv.org/html/2606.27598#bib.bib8)\); Choi et al\. \([2018](https://arxiv.org/html/2606.27598#bib.bib2)\); Ling and Weld \([2021](https://arxiv.org/html/2606.27598#bib.bib17)\); Ding et al\. \([2021](https://arxiv.org/html/2606.27598#bib.bib5)\)\.

To address this gap we construct Narrative\-UFET, an extended version of the UFET datasetChoi et al\. \([2018](https://arxiv.org/html/2606.27598#bib.bib2)\)in which each entity\-sentence pair is paired with a short, automatically generated narrative built around the target entity\. We choose to synthesize narratives rather than retrieve them from real corpora because synthesis lets us control specific properties of the discourse, holding everything else fixed\. This control is the core methodological move of the paper, as it lets us isolate the effect of a single discourse property on typing performance\. As a case study, we construct two variants of Narrative\-UFET that differ in one such property: whether the entity’s type is held constant across the narrative \(Maintain\) or shifts across it \(Change\)\. We use this contrast to test whether type variation across discourse is a useful signal for ultra\-fine typing\.

We validate the quality of the generated narratives through automated metrics and human evaluation\. Evaluating both masked and causal PLMs on the UFET task, we find that narrative context yields consistent improvements on long\-tail types over sentence\-level baselines, with theChangevariant providing the stronger signal\. A comparison against naturally occurring contexts shows that our synthetic narratives yield stronger gains than real text alone, indicating that controlled discourse construction can surface signal that real text leaves implicit\. At the same time, substantial room for improvement remains, suggesting that progress on long\-tail entity typing will require advances in both discourse\-aware modeling and narrative construction beyond the single dimension studied here\.

Our contributions are: \(i\)Narrative\-UFET, a controlled narrative\-level extension of UFET with two variants \(MaintainandChange\); \(ii\) evidence that narrative context improves typing on long\-tail entities, withChangeproviding the strongest signal; and \(iii\) a comparison against naturally occurring contexts showing that synthetic narratives surface signal that real text leaves implicit\.

## 2Related Work

Narrative Generation and Evaluation\.With the rise of powerful generative models, large language models \(LLMs\) are increasingly used to generate high\-quality short and long narratives that exhibit more nuanced plot development and stylistic variationGoldfarb\-Tarrant et al\. \([2020](https://arxiv.org/html/2606.27598#bib.bib9)\); Yang et al\. \([2022](https://arxiv.org/html/2606.27598#bib.bib27)\); Harel\-Canada et al\. \([2024](https://arxiv.org/html/2606.27598#bib.bib11)\)\. Additionally, recent work has explored the use of LLMs in assessing the quality of narratives\. This is a complex task as quality is multi\-faceted and at times subjective\. Prior work has evaluated grammar as language fluencyNaismith et al\. \([2023](https://arxiv.org/html/2606.27598#bib.bib18)\), creativity and originality as indicators of narrative novelty, and consistency in plot and character progression to assess story coherence over timeChhun et al\. \([2022](https://arxiv.org/html/2606.27598#bib.bib1)\); Tian et al\. \([2024](https://arxiv.org/html/2606.27598#bib.bib24)\)\. Recent research also emphasizes the importance of prompt engineering in guiding LLMs to produce more coherent and contextually appropriate narrativesTang et al\. \([2024](https://arxiv.org/html/2606.27598#bib.bib23)\)\. Although prior work has advanced narrative generation and evaluation, it remains unclear how well models can expand an entity\-context pair into a complete narrative that is coherent and remains reliable when evaluated on the entity\-typing task\. To construct Narrative\-UFET we conduct model and prompt testing to generate narratives\. We then verify the validity of Narrative\-UFET through automated and human evaluation\.

UFET and Pre\-trained Language Models\.UFET was introduced byChoi et al\. \([2018](https://arxiv.org/html/2606.27598#bib.bib2)\), who proposed the task of predicting free\-form type labels for a target entity mention given the sentence in which it appears\. Subsequent work leveraged PLMs to improve entity typing performance\. Many of these approaches frame UFET as a masked language modeling problem\.Dai et al\. \([2021](https://arxiv.org/html/2606.27598#bib.bib3)\)use Hearst patterns with \[MASK\] tokens and BERT to predict entity types\. Similarly,Pan et al\. \([2022](https://arxiv.org/html/2606.27598#bib.bib20)\)generate ultra\-fine type predictions by appending an entity mention and a \[MASK\] token to the input sentence\.Deshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\)expand on this work by exploring how such approaches perform on infrequent or rare entities, showing that PLMs struggle with long\-tail entity typing unless additional knowledge about rare entities is incorporated, and even then the explored knowledge injection strategies proved insufficient\. Motivated by these findings, we use Narrative\-UFET to investigate whether richer discourse context can help PLMs better predict long\-tail types, evaluating both masked and causal PLMs\.

## 3Narrative\-UFET

In this section, we describe the construction and evaluation of Narrative\-UFET\. We build on the crowd\-annotated portion of the UFET datasetChoi et al\. \([2018](https://arxiv.org/html/2606.27598#bib.bib2)\), which contains 5,994 entity mentions paired with their surrounding sentence context and a set of human\-annotated ultra\-fine types \(hereafter, gold types\), evenly divided into training, development, and test splits\.

### 3\.1Narrative Generation Pipeline

For each entity\-sentence pair in the UFET dataset, we generated a short, self\-contained narrative that embeds the original sentence verbatim while building a coherent narrative chain around the target entity\. An example narrative is shown in Appendix[A\.1\.2](https://arxiv.org/html/2606.27598#A1.SS1.SSS2)\. The generation pipeline involved three stages: model selection, prompt design, and the generation of final datasets\.

Model Selection\.We generated narratives for the first 100 instances from the development set by instructing seven different models111GPT\-OSS\-20B, Llama3\.3\-70B, Gemma3\-27B, Qwen3\-8B, Qwen3\-14B, Qwen3\-32B, and Mistral\-7Bto produce short, coherent stories around each target entity, requiring that the original UFET sentence appear verbatim \(prompt and example narrative in Appendices[A\.1\.1](https://arxiv.org/html/2606.27598#A1.SS1.SSS1)and[A\.1\.2](https://arxiv.org/html/2606.27598#A1.SS1.SSS2)\)\.

We test quality across three dimensions: \(1\)Narrative Quality, using the TinyStories frameworkEldan and Li \([2023](https://arxiv.org/html/2606.27598#bib.bib6)\), which scores grammar, creativity, consistency, and plot on a 1–10 scale\. Prompt and other details are shown in Appendix[A\.2](https://arxiv.org/html/2606.27598#A1.SS2); \(2\)Discourse Coherence, measured as both context\-to\-story alignment \(how semantically related each sentence in the narrative is to the original sentence\) and story\-internal coherence \(how related each sentence is to its prior sentence\)\. Appendix[A\.3](https://arxiv.org/html/2606.27598#A1.SS3)shows implementation details; and \(3\)Coreference Density, which is split into two categories, coreference chain length which is computed as the total number of target entity mentions across the narrative, where longer chains indicate richer entity\-centered context \(as there is now more information available for that entity\), and coreference density which uses the narrative sentence count to normalize the coreference chain length\. This is a helpful metric for when model generations do not have an identical number of sentences\. All metrics are given equal importance\.

Additionally, a qualitative analysis is done to see if human judgment matches the scores\. This analysis is conducted by a single human annotator without specific guidelines, relying on personal judgment to evaluate the generated narratives\.

We found that models were inconsistent across evaluation dimensions\. The Gemma3\-27B model only performed well withNarrative Quality, while the Mistral\-7B model only performed well inDiscourse Coherence\. Qwen3 family models were consistently strong in all three dimensions\. The qualitative review revealed that Mistral\-7B, Llama3\.3\-70B, and Qwen3\-8B frequently did not include the original sentence verbatim, while GPT\-OSS\-20B and Gemma3\-27B produced repetitive narrative patterns that reduced contextual variety\. Based on the combined quantitative and qualitative assessment, we selected Qwen3\-32B, which achieved the best balance ofNarrative Quality,Discourse Coherence, andCoreference Density\. Figures detailing all results are in Appendix[B\.1](https://arxiv.org/html/2606.27598#A2.SS1)\.

Prompt Design\.Using Qwen3\-32B, we systematically varied two prompt dimensions, evaluating each on the development set using the same metrics described above\. \(1\)Number of characters: We tested prompts specifying 2, 3, or any number of entities\. Unconstrained prompts yielded the most coherent narratives, as fixed character counts introduced unnecessary complexity\. \(2\)Narrative length: We tested lengths of 5, 10, 15 and 20 sentences\. Narratives of 10 sentences offered the best trade\-off between coherence and coreference density, with longer narratives suffering from grammatical and consistency degradation\. The final prompt combining all optimal settings are provided in Appendix[A\.6](https://arxiv.org/html/2606.27598#A1.SS6)\. Detailed results for all design dimensions are included in Appendix[B\.2](https://arxiv.org/html/2606.27598#A2.SS2)\.

Final Dataset Generation\.Using the final prompt and Qwen3\-32B, we generated narratives for all 5,994 instances in the UFET crowd\-annotated set\. Each narrative is a 10\-sentence passage that embeds the original sentence verbatim, with no constraint on the number of characters\. We generate two variants: Narrative\-UFET\-Change, in which the prompt instructs the generator to shift the type of the target entity across the narrative, and Narrative\-UFET\-Maintain, in which the prompt instructs it to hold the type constant\. Crucially, the gold types from UFET are never shown to the generator\. The prompt specifies only whether the type should change or remain stable, not which type to use\. This ensures that any improvements observed under Narrative\-UFET reflect richer discourse context rather than direct exposure to the labels\.

### 3\.2Human Validation

To verify narrative quality beyond automated metrics, we conducted a human evaluation on 100 randomly sampled narratives from the test set on both Narrative\-UFET\-Changeand Narrative\-UFET\-Maintain\. There were four different annotators in total, two annotators for each dataset\. Annotators rated each narrative on a 5\-point Likert scale across six dimensions: grammar, creativity, consistency, plot, context\-to\-story coherence, and story\-internal coherence\. Rubric definitions are provided in Appendix[C\.1](https://arxiv.org/html/2606.27598#A3.SS1)\. A pilot study on five examples confirmed calibration, with annotators agreeing or differing by at most 1 point on all dimensions\.

We report inter\-annotator agreement using Gwet’s AC2Gwet \([2014](https://arxiv.org/html/2606.27598#bib.bib10)\)\. This agreement method was utilized due to the heavy right skew from all annotators scores \(every annotator gave majority scores of 3 or higher in all dimensions\)\. Agreement was substantial to almost perfect on five of six dimensions: grammar \(AC2 = 0\.80, 0\.84\), context\-to\-story coherence \(AC2 = 0\.86, 0\.84\), story\-internal coherence \(AC2 = 0\.87, 0\.81\), consistency \(AC2 = 0\.89, 0\.78\), and plot \(AC2 = 0\.86, 0\.71\)\. Creativity showed substantial agreement on Narrative\-UFET\-Change\(AC2 = 0\.7365\) and moderate agreement on Narrative\-UFET\-Maintain\(AC2 = 0\.45\), reflecting its inherently subjective nature\. Average scores across both annotators ranged from 3\.56 to 4\.80, indicating that the generated narratives are of consistently high quality\. Per\-dimension scores and annotator breakdowns in Appendix[C\.2](https://arxiv.org/html/2606.27598#A3.SS2)\.

## 4Experiments

We evaluate whether Narrative\-UFET improves entity typing performance with respect to sentence\-level UFET, particularly for long\-tail entities\. FollowingDeshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\), we partition the UFET test set into four bins\. Details on the bin split and internet search hits are shown in Appendix[D\.1](https://arxiv.org/html/2606.27598#A4.SS1)\. We evaluate using both masked language model \(MLM\) and causal language model \(CLM\) approaches\. All scoring is done following the UFET guidelinesChoi et al\. \([2018](https://arxiv.org/html/2606.27598#bib.bib2)\)\.

Narrative context improves typing performance\.We adopt the MLM and CLM setups ofDeshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\); details are in Appendices[D\.2](https://arxiv.org/html/2606.27598#A4.SS2)and[D\.3](https://arxiv.org/html/2606.27598#A4.SS3)\. For CLMs, we run each condition five times and report the mean to account for model variability\. Figure[1](https://arxiv.org/html/2606.27598#S4.F1)compares performance between the sentence\-level baseline and Narrative\-UFET across both setups\. Narrative\-UFET yields consistent improvements across all bins, with CLMs reaching higher overall performance than MLMs, and Qwen showing a larger relative gain\. Across both model types, Narrative\-UFET\-Changesubstantially outperforms Narrative\-UFET\-Maintain, suggesting that changing the type across the narrative produces a richer typing signal\.

Synthetic narratives outperform real context\.To check that our synthetic narratives capture properties of real discourse, we run both MLM and CLM evaluations using the original OntoNotes 5\.0 contextsGillick et al\. \([2014a](https://arxiv.org/html/2606.27598#bib.bib7)\)\. Figure[1](https://arxiv.org/html/2606.27598#S4.F1)shows that Narrative\-UFET\-Changeoutperforms OntoNotes context across both model types, while OntoNotes performance sits between Standard\-UFET and Narrative\-UFET\-Maintain\. This indicates that the synthetic narratives preserve properties of naturally occurring context while the controlled type\-shift design inChangesurfaces typing signal that real text leaves implicit\.

Results for additional model families and sizes, and additional analyses in Appendix[D\.4](https://arxiv.org/html/2606.27598#A4.SS4)\.

![Refer to caption](https://arxiv.org/html/2606.27598v1/images/f1_scores/graph1_all_together.png)Figure 1:Type F1 Scores Across UFET Bins: MLM and CLM performance between Narrative\-UFET, sentence\-level, and OntoNotes 5\.0 context\.
## 5Conclusion and Future Work

We presentedNarrative\-UFET, a controlled narrative\-level extension of UFET in which each entity mention is paired with an automatically generated narrative built around it\. Our hypothesis was that synthesizing narratives lets us isolate the effect of specific discourse properties\. To study this, we constructed two variants that differ in whether the entity’s type is held constant \(Maintain\) or shifts \(Change\) across the narrative\. We found that narrative context yields consistent gains on long\-tail entities over sentence\-level baselines, with theChangevariant providing the stronger typing signal, and outperforming naturally occurring contexts, suggesting that controlled construction can surface signal that real text leaves implicit\.

The type\-consistency contrast studied here is one of many discourse properties that controlled narrative construction makes accessible\. Coreference density, interactions between co\-occurring entities, the granularity of type shifts, and the distribution of typing evidence across sentences are all dimensions that Narrative\-UFET could be extended along, and each may explain a different part of the gap between current models and the typing signal available in discourse\. We see this as the main direction opened by our work: not just longer context for entity typing, but a systematic investigation of which discourse properties carry typing signal and how models can exploit them\.

## Limitations

We identify six main limitations\. \(1\) Our case study varies only one property of the narrative: whether the entity’s type is held constant or shifts\. Other discourse properties \(e\.g\., coreference density, inter\-entity interactions, the distribution of typing evidence across sentences\) are likely to carry typing signal as well, and we leave their systematic study to future work\. \(2\) While we validate narrative quality through automated and human evaluation, and compare against naturally occurring contexts from OntoNotes, synthetic narratives may deviate from natural discourse distributions in ways that our evaluation does not capture\. Extending Narrative\-UFET with retrieved or human\-authored narratives is an important direction\. \(3\) We attribute the stronger performance of theChangevariant to richer typing signal from type variation across discourse, but we do not control for confounding factors such as token count, lexical diversity, or the number of distinct types mentioned\. The relative contribution of these factors remains open\. \(4\) Our experiments cover one MLM \(BERT\) and two CLMs \(Llama3\.3\-70B, Qwen3\-32B\) on a single base typing dataset \(UFET, English, crowd\-annotated\)\. Findings may not transfer to other typing schemes, languages, or model families\. \(5\) For computational reasons, all models in Section[3](https://arxiv.org/html/2606.27598#S3)were run in quantized form\. Quantization can affect both narrative generation quality and typing predictions\. We did not systematically measure these effects\. \(6\) Our qualitative review during model selection was conducted by a single annotator without formal guidelines\. While the final dataset is also validated through inter\-annotator human evaluation with formal rubrics, the model\-selection stage relied on lighter\-weight judgment\.

## Ethical Considerations

To the best of our knowledge, this work does not raise any ethical concerns\. All information required to replicate our experiments is provided in the paper and the appendices\. We use only open\-source language models with publicly available and versioned weights\. Importantly, we ensure separation between the models used for narrative generation, narrative evaluation, and entity typing evaluation, avoiding circular dependencies that could inflate results\. Additional plots and implementation details are provided in the appendix\.

## References

- Chhun et al\. \(2022\)Cyril Chhun, Pierre Colombo, Fabian M\. Suchanek, and Chloé Clavel\. 2022\.[Of human criteria and automatic metrics: A benchmark of the evaluation of story generation](https://aclanthology.org/2022.coling-1.509/)\.In*Proceedings of the 29th International Conference on Computational Linguistics*, pages 5794–5836, Gyeongju, Republic of Korea\. International Committee on Computational Linguistics\.
- Choi et al\. \(2018\)Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer\. 2018\.[Ultra\-fine entity typing](https://doi.org/10.18653/v1/P18-1009)\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 87–96, Melbourne, Australia\. Association for Computational Linguistics\.
- Dai et al\. \(2021\)Hongliang Dai, Yangqiu Song, and Haixun Wang\. 2021\.[Ultra\-fine entity typing with weak supervision from a masked language model](https://doi.org/10.18653/v1/2021.acl-long.141)\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 1790–1799, Online\. Association for Computational Linguistics\.
- Deshmukh et al\. \(2025\)Advait Deshmukh, Ashwin Umadi, Dananjay Srinivas, and Maria Leonor Pacheco\. 2025\.[All entities are not created equal: Examining the long tail for ultra\-fine entity typing](https://doi.org/10.18653/v1/2025.starsem-1.15)\.In*Proceedings of the 14th Joint Conference on Lexical and Computational Semantics \(\*SEM 2025\)*, pages 189–201, Suzhou, China\. Association for Computational Linguistics\.
- Ding et al\. \(2021\)Ning Ding, Guangwei Xu, Yulin Chen, Xiaobin Wang, Xu Han, Pengjun Xie, Haitao Zheng, and Zhiyuan Liu\. 2021\.[Few\-NERD: A few\-shot named entity recognition dataset](https://doi.org/10.18653/v1/2021.acl-long.248)\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 3198–3213, Online\. Association for Computational Linguistics\.
- Eldan and Li \(2023\)Ronen Eldan and Yuanzhi Li\. 2023\.[Tinystories: How small can language models be and still speak coherent english?](https://arxiv.org/abs/2305.07759)*Preprint*, arXiv:2305\.07759\.
- Gillick et al\. \(2014a\)Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh\. 2014a\.[Context\-dependent fine\-grained entity type tagging](https://arxiv.org/abs/1412.1820)\.*CoRR*, abs/1412\.1820\.
- Gillick et al\. \(2014b\)Daniel Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh\. 2014b\.[Context\-dependent fine\-grained entity type tagging](https://api.semanticscholar.org/CorpusID:9836000)\.*ArXiv*, abs/1412\.1820\.
- Goldfarb\-Tarrant et al\. \(2020\)Seraphina Goldfarb\-Tarrant, Tuhin Chakrabarty, Ralph Weischedel, and Nanyun Peng\. 2020\.[Content planning for neural story generation with aristotelian rescoring](https://doi.org/10.18653/v1/2020.emnlp-main.351)\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 4319–4338, Online\. Association for Computational Linguistics\.
- Gwet \(2014\)K\.L\. Gwet\. 2014\.[*Handbook of Inter\-Rater Reliability, 4th Edition: The Definitive Guide to Measuring The Extent of Agreement Among Raters*](https://books.google.com/books?id=fac9BQAAQBAJ)\.Advanced Analytics, LLC\.
- Harel\-Canada et al\. \(2024\)Fabrice Y Harel\-Canada, Hanyu Zhou, Sreya Muppalla, Zeynep Senahan Yildiz, Miryung Kim, Amit Sahai, and Nanyun Peng\. 2024\.[Measuring psychological depth in language models](https://doi.org/10.18653/v1/2024.emnlp-main.953)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 17162–17196, Miami, Florida, USA\. Association for Computational Linguistics\.
- Koch et al\. \(2014\)Mitchell Koch, John Gilmer, Stephen Soderland, and Daniel S\. Weld\. 2014\.[Type\-aware distantly supervised relation extraction with linked arguments](https://doi.org/10.3115/v1/D14-1203)\.In*Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 1891–1901, Doha, Qatar\. Association for Computational Linguistics\.
- Li et al\. \(2024a\)Jinyuan Li, Han Li, Di Sun, Jiahao Wang, Wenkun Zhang, Zan Wang, and Gang Pan\. 2024a\.[LLMs as bridges: Reformulating grounded multimodal named entity recognition](https://doi.org/10.18653/v1/2024.findings-acl.76)\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 1302–1318, Bangkok, Thailand\. Association for Computational Linguistics\.
- Li et al\. \(2024b\)Muzhi Li, Minda Hu, Irwin King, and Ho\-fung Leung\. 2024b\.[The integration of semantic and structural knowledge in knowledge graph entity typing](https://doi.org/10.18653/v1/2024.naacl-long.369)\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 6625–6638, Mexico City, Mexico\. Association for Computational Linguistics\.
- Li et al\. \(2023\)Na Li, Zied Bouraoui, and Steven Schockaert\. 2023\.[Ultra\-fine entity typing with prior knowledge about labels: A simple clustering based strategy](https://doi.org/10.18653/v1/2023.findings-emnlp.786)\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pages 11744–11756, Singapore\. Association for Computational Linguistics\.
- Ling et al\. \(2015\)Xiao Ling, Sameer Singh, and Daniel S\. Weld\. 2015\.[Design challenges for entity linking](https://doi.org/10.1162/tacl_a_00141)\.*Transactions of the Association for Computational Linguistics*, 3:315–328\.
- Ling and Weld \(2021\)Xiao Ling and Daniel Weld\. 2021\.[Fine\-grained entity recognition](https://doi.org/10.1609/aaai.v26i1.8122)\.*Proceedings of the AAAI Conference on Artificial Intelligence*, 26\(1\):94–100\.
- Naismith et al\. \(2023\)Ben Naismith, Phoebe Mulcaire, and Jill Burstein\. 2023\.[Automated evaluation of written discourse coherence using GPT\-4](https://doi.org/10.18653/v1/2023.bea-1.32)\.In*Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2023\)*, pages 394–403, Toronto, Canada\. Association for Computational Linguistics\.
- Onoe and Durrett \(2020\)Yasumasa Onoe and Greg Durrett\. 2020\.[Interpretable entity representations through large\-scale typing](https://doi.org/10.18653/v1/2020.findings-emnlp.54)\.In*Findings of the Association for Computational Linguistics: EMNLP 2020*, pages 612–624, Online\. Association for Computational Linguistics\.
- Pan et al\. \(2022\)Weiran Pan, Wei Wei, and Feida Zhu\. 2022\.[Automatic noisy label correction for fine\-grained entity typing](https://doi.org/10.24963/ijcai.2022/599)\.In*Proceedings of the Thirty\-First International Joint Conference on Artificial Intelligence, IJCAI\-22*, pages 4317–4323\. International Joint Conferences on Artificial Intelligence Organization\.Main Track\.
- Raffel et al\. \(2020\)Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J\. Liu\. 2020\.[Exploring the limits of transfer learning with a unified text\-to\-text transformer](http://jmlr.org/papers/v21/20-074.html)\.*Journal of Machine Learning Research*, 21\(140\):1–67\.
- Riedel et al\. \(2010\)Sebastian Riedel, Limin Yao, and Andrew McCallum\. 2010\.[Modeling relations and their mentions without labeled text](https://api.semanticscholar.org/CorpusID:2386383)\.In*ECML/PKDD*\.
- Tang et al\. \(2024\)Xiaoyi Tang, Hongwei Chen, Daoyu Lin, and Kexin Li\. 2024\.[Harnessing llms for multi\-dimensional writing assessment: Reliability and alignment with human judgments](https://doi.org/10.1016/j.heliyon.2024.e34262)\.*Heliyon*, 10\(14\):e34262\.
- Tian et al\. \(2024\)Yufei Tian, Tenghao Huang, Miri Liu, Derek Jiang, Alexander Spangher, Muhao Chen, Jonathan May, and Nanyun Peng\. 2024\.[Are large language models capable of generating human\-level narratives?](https://doi.org/10.18653/v1/2024.emnlp-main.978)In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 17659–17681, Miami, Florida, USA\. Association for Computational Linguistics\.
- Wang et al\. \(2024\)Ziqi Wang, Chen Zhu, Zhi Zheng, Xinhang Li, Tong Xu, Yongyi He, Qi Liu, Ying Yu, and Enhong Chen\. 2024\.[Granular entity mapper: Advancing fine\-grained multimodal named entity recognition and grounding](https://doi.org/10.18653/v1/2024.findings-emnlp.183)\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 3211–3226, Miami, Florida, USA\. Association for Computational Linguistics\.
- Weber et al\. \(2024\)Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, and Ce Zhang\. 2024\.[Redpajama: an open dataset for training large language models](https://arxiv.org/abs/2411.12372)\.*Preprint*, arXiv:2411\.12372\.
- Yang et al\. \(2022\)Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein\. 2022\.[Re3: Generating longer stories with recursive reprompting and revision](https://doi.org/10.18653/v1/2022.emnlp-main.296)\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pages 4393–4479, Abu Dhabi, United Arab Emirates\. Association for Computational Linguistics\.
- Zhu et al\. \(2015\)Yukun Zhu, Ryan Kiros, Richard S\. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler\. 2015\.[Aligning books and movies: Towards story\-like visual explanations by watching movies and reading books](https://arxiv.org/abs/1506.06724)\.*CoRR*, abs/1506\.06724\.

## Appendix AExperimental Details for Narrative Generation

In this section all experimental outputs were generated using models accessed through Ollama, using the default inference settings provided by Ollama for each model\. All models were run in quantized form, with GPT\-OSS\-20B using MXFP4 quantized, while Llama3\.3\-70B, Gemma3\-27B, Qwen3\-8B, Qwen3\-14B, Qwen3\-32B, and Mistral\-7B using Q4\_K\_M quantization\. Each model kept its default temperature setting\. We list all prompts used in narrative generation, including those for model testing, narrative evaluation, prompt testing, and final narrative generation\. Prompt variants are shown in a consolidated form, as most variants only differ minimally\. The placement of prompt variants reflects where they are input to the model\. However, in our experiments each variant was run individually\.

### A\.1Model Testing

#### A\.1\.1Prompt

To select the best model for our final narrative generations the following prompt was used\.

You are writing a short, coherent story \(<15 sentences\) about the entity: "\{entity\}"\. Rules: \- The sentence below MUST appear word\-for\-word, exactly as written in the narrative\. \- You cannot paraphrase, reword, or change punctuation in that sentence\. \- Wrap every mention of the entity "\{entity\}" in <e\></e\> tags\. \- The story should naturally include the sentence and expand around it\. MANDATORY SENTENCE \(must appear exactly as shown\): "\{entity\_sentence\}"

#### A\.1\.2Generation Example

Given the entitySheand the given sentenceShe said the Broad \- Burkle offer takes on a ‘shocking’ amount of debt and the Chandlers, despite their bid, do not really run the business”, a generated narrative would be:

“The boardroom hummed with tension assheadjusted her glasses, her voice cutting through the murmurs\.Shesaid the Broad \- Burkle offer takes on a “shocking” amount of debt and the Chandlers, despite their bid, do not really want to run the business\. A junior analyst choked on her coffee, while the CEO’s jaw tightened\.Sheleaned forward, charts spread across the table, each number a dagger to the room’s fragile optimism\. “The Chandlers are bidding to inflate the price,”sheadded, “not to steward the company\.” The chairperson sighed, rubbing his temples\. “And your recommendation?”shepaused, then said, “We walk\.” The room erupted in chaos\. Later, alone in the elevator,shereplayed the meeting, wondering if truth could ever feel less like a weapon\. By dawn, the Chandlers had withdrawn\. The Broad\-Burkle deal collapsed\.Shenever returned to that boardroom, but the quote lingered, etched into the company’s lore\.”

### A\.2Prompt for TinyStories Evaluation Framework

The following prompt is a modified version of the TinyStories evaluation framework evaluated on ’Grammar,’ ’Creativity,’ ’Consistency,’ and ’Plot\.’ Due to model constraints GPT\-4 was swapped with GPT\-OSS\-20B\.

Your task is to evaluate how well the narrative was written\. \- The narrative is: \{narrative\} \- The story should contain the following sentence: \{entity\_sentence\}\.Grade the narrative in terms of grammar, creativity, consistency, and whether the plot makes sense\. Use the following grading format \(exactly\): \- Grammar: x/10, Creativity: x/10, Consistency: x/10, Plot: x/10

### A\.3Discourse Coherence Implementation Details

Discourse Coherence is measured using a bert\-based\-uncased model which produces a sentence\-level embedding vector, mapping each sentence\. Cosine similarity is then ran on the vectors where a value of 1 indicates high relatedness and a value of 0 indicating low relatedness\.

### A\.4Number of Characters

To study the effect of character constraints on narrative quality, we experiment with prompts that have a specified number of characters \(2 or 3\) as well as a setting where the model determines the number of characters\.

You are writing a short, coherent story \(<15 sentences\) about the entity "\{entity\}"\.Rules: \- The sentence below MUST appear word\-for\-word, exactly as written in the narrative\. \- You cannot paraphrase, reword, or change punctuation in that sentence\. \- Wrap every mention of the entity "entity" in <e\></e\> tags\. \- The story should naturally include that sentence and expand around it\. Character Count Variants: \-2 Characters:The story must have 2 characters \-3 Characters:The story must have 3 characters \-Any Number of Characters:The number of characters in the story is up to you\. MANDATORY SENTENCE \(must appear exactly as shown\): "\{entity\_sentence\}"

### A\.5Narrative Length

Additionally we experiment with different narrative lengths\. The prompt is displayed below\.

Length Variants:5 / 10 / 15 / 20 sentences\. You are writing a short, coherent story about the entity "\{entity\}" with the specified length\.Rules: \- The sentence below MUST appear word\-for\-word, exactly as written in the narrative\. \- You cannot paraphrase, reword, or change punctuation in that sentence\. \- Wrap every mention of the entity "\{entity\}" in <e\></e\> tags\. \- The story should naturally include that sentence and expand around it\. \- The number of characters in the story is up to you\. MANDATORY SENTENCE \(must appear exactly as shown\): "\{entity\_sentence\}"

### A\.6Final Narrative

This prompt contains the final instructions used to generate the two different datasets\. We name dataset 1Narrative\-UFET\-Changeand dataset 2Narrative\-UFET\-Maintain\.

You are writing a short, coherent story that is 10 sentences about the entity "\{entity\}"\. Rules: \- The sentence below MUST appear word\-for\-word, exactly as written in the narrative\. \- You cannot paraphrase, reword, or change punctuation in that sentence\. \- Wrap every mention of the entity "\{entity\}" in <e\></e\> tags\. \- The story should naturally include that sentence and expand around it\. \- The number of characters in the story is up to you\. Type Variants: \-Change Type:You must change the type of the entity throughout the story\. The number of changes is up to you\. \-Maintain Type:You must choose one type which represents the entity throughout the story\. MANDATORY SENTENCE \(must appear exactly as shown\): "\{entity\_sentence\}"

## Appendix BEvaluation of Narrative Generation Pipeline

### B\.1Model Selection

Figures[2](https://arxiv.org/html/2606.27598#A2.F2)\-[4](https://arxiv.org/html/2606.27598#A2.F4)present all dimensions of model selection results, including narrative quality, discourse coherence, and coreference chain length\.

![Refer to caption](https://arxiv.org/html/2606.27598v1/images/first_narrative_generations_results/tinystories_results_first_nar_gen.png)Figure 2:Mean TinyStories narrative quality scores across models for grammar, creativity, consistency, and plot for model testing\.![Refer to caption](https://arxiv.org/html/2606.27598v1/images/first_narrative_generations_results/coherence_results_first_nar_gen.png)Figure 3:Mean coherence scores across models for model testing\.![Refer to caption](https://arxiv.org/html/2606.27598v1/images/first_narrative_generations_results/coreference_results_first_nar_gen.png)Figure 4:Mean coreference scores across models for model testing\.
### B\.2Prompt Design

#### B\.2\.1Number of Characters

Figures[5](https://arxiv.org/html/2606.27598#A2.F5)\-[7](https://arxiv.org/html/2606.27598#A2.F7)show all evaluation dimensions of the number\-of\-characters prompt results\.

![Refer to caption](https://arxiv.org/html/2606.27598v1/images/prompt_testing/character_testing/tinystories_characters.png)Figure 5:Mean TinyStories narrative quality scores across models for grammar, creativity, consistency, and plot for number of character testing\.![Refer to caption](https://arxiv.org/html/2606.27598v1/images/prompt_testing/character_testing/coherence_character.png)Figure 6:Mean coherence scores for character testing![Refer to caption](https://arxiv.org/html/2606.27598v1/images/prompt_testing/character_testing/coreference_characters.png)Figure 7:Mean coreference scores for character testing\.
#### B\.2\.2Narrative Length

Figures[8](https://arxiv.org/html/2606.27598#A2.F8)\-[10](https://arxiv.org/html/2606.27598#A2.F10)present all dimensions of the narrative\-length results\.

![Refer to caption](https://arxiv.org/html/2606.27598v1/images/prompt_testing/length_testing/tinystories_length.png)Figure 8:Mean TinyStories narrative quality scores across models for grammar, creativity, consistency, and plot for narrative lengths\.![Refer to caption](https://arxiv.org/html/2606.27598v1/images/prompt_testing/length_testing/coherence_narrative.png)Figure 9:Mean coherence scores for narrative lengths\.![Refer to caption](https://arxiv.org/html/2606.27598v1/images/prompt_testing/length_testing/coreference_narrative.png)Figure 10:Mean coreference density scores for different narrative lengths\.

## Appendix CHuman Evaluation of Narratives

This section details the human evaluation setup and results for Narrative\-UFET\.

### C\.1Rubric Dimensions and Descriptions

All rubric dimensions and descriptions are shown in Table[1](https://arxiv.org/html/2606.27598#A3.T1)\.

Table 1:Dimensions used for human evaluation of narrative quality\. Each dimension is rated on a 5\-point Likert scale \(1 = strongly disagree, 5 = strongly agree\)\.
### C\.2Annotator Results

Table[2](https://arxiv.org/html/2606.27598#A3.T2)presents the average annotator scores and Gwet \(AC2\) agreement results for Narrative\-UFET\-Change and Table[3](https://arxiv.org/html/2606.27598#A3.T3)presents the same for Narrative\-UFET\-Maintain\.

Table 2:Annotator & Gwet \(AC2\) Results for Narrative\-UFET\-ChangeTable 3:Annotator & Gwet \(AC2\) Results for Narrative\-UFET\-Maintain

## Appendix DEntity\-Typing

This section shows the experimental design and results of MLMs and CLMs\.

### D\.1Bin Split Details

FollowingDeshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\), we partition the UFET test set into four bins based on entity frequency\. Entity frequency scores are calculated for all target UFET entities using the Google Custom Search API where internet search hits are used as a proxy for how often an entity appears in the PLM pretraining data\. Entities are then grouped into the four bins based on quartiles\. Bin 1 contains the rarest entities and Bin 4 the most frequent\. Table[4](https://arxiv.org/html/2606.27598#A4.T4)shows this bin split\.

To make sure that internet search hits provide a good proxy for estimating realtive entity frequencyDeshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\)validate the proxy by correlating it with real\-world datasets known to be used in PLM pretraining, acknowledging that such disclosures are limited to only a few models\. Some datasets include BookCorpusZhu et al\. \([2015](https://arxiv.org/html/2606.27598#bib.bib28)\), C4Raffel et al\. \([2020](https://arxiv.org/html/2606.27598#bib.bib21)\), and RedPajamaWeber et al\. \([2024](https://arxiv.org/html/2606.27598#bib.bib26)\)\.

Additionally,Deshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\)recognize that the Internet is constantly evolving and that temporal dynamics could potentially alter the distribution of entities\. For this reason, they performed their analysis with API data capped from 2018 to 2024, and found the results to be consistent over time, with minor changes in correlation coefficients\. They also recognize that entity classification can change across large time periods, they compare the correlation between the LM predictions for an entity and the API data capped at 2018 and 2024, they find that results are largely consistent across these two time periods, showing that only 39 entities from the test set \(<2%\) changed their bin classifications\. For our main results, we rank entities using the 2024 results\.

Table 4:Distribution of entities across UFET test binsDeshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\)
### D\.2MLM Hearst Pattern Details

In order to test Narrative\-UFET via MLMs, three hearst\-like patterns are used, where \[MASK\] tokens are appended to the entity mention\. In this setup, the first entity mentioned in the narrative is replaced by a \[MASK\], where the following three prompting strategies are used: ‘\[MASK\] such as entity,’ ‘entity and any other \[MASK\],’ and ‘entity and some other \[MASK\]’Deshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\)\. Additionally, we test how many types the model should generate per narrative using the development set, this is represented by the hyperparameternn\. Valuesn=1n=1ton=10n=10are tested\. The test set is then evaluated using only the hearst\-like patterns and hyperparameternn\.

### D\.3CLM Narrative Prompt for Entity Typing

In this experiment, Llama3\.3\-70B and Qwen3\-32B were run in Q4\_K\_M quantized form using Ollama, with the temperature parameter set to 0 for both modelsDeshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\)\. 15\-shot prompting is used with examples taken from the development set\. The prompt to generate types auto\-regressively is presented below\.

\# Entity\-Typing Assistant You are a precise entity\-typing assistant\.Given a narrative in which all entity mentions are wrapped in ’<ENT\> \.\.\. </ENT\>’ tags, produce only a JSON object whose single key is "predicted\_types"\. Only predict types for the entity mention wrapped in ’<ENT\> \.\.\. </ENT\>’ tags\. Use the full narrative as context ONLY if it describes that same entity\. Do NOT predict types for the overall narrative topic or other entities\. \#\# Guidelines \- The value must be a JSON array of strings\. \- Aim for HIGH RECALL: include every type that is supported by the text, even if it is broad/redundant\. \- Always include the best broad class when supported Example: \["person", "organization", "place", "location", "object"\]\. \- Remove duplicates, keep each type concise \(ideally a short noun phrase\), and use labels consistent to the examples\. \- If both a broad type and a more specific subtype are valid labels, include BOTH\. Example: \["college", "community\_college"\]\. \- Do not output any keys other than "predicted\_types"\.\#\# Input Format \- ENTITY\_SENTENCE: The sentence that contains the target entity marked with <ENT\> tags \(primary evidence\)\. \- NARRATIVE: The complete narrative with all target entities clearly mapped in <ENT\> tags \(secondary evidence\)\. \- ENTITY\_MENTION: The target entity mention from the narrative\. \#\# Output Format \(ONLY JSON\) \{ "predicted\_types": \["TypeA", "TypeB", "TypeC", \.\.\.\] \}

### D\.4Entity\-Typing Results

Tables[5](https://arxiv.org/html/2606.27598#A4.T5)\-[16](https://arxiv.org/html/2606.27598#A4.T16)show expanded precision, recall, and F1 scores for the standard and expanded entity\-context representation\. Figures[11](https://arxiv.org/html/2606.27598#A4.F11)\-[16](https://arxiv.org/html/2606.27598#A4.F16)show the F1 scores of all models in varying ways\.

##### Overall Impact of Narrative Context

Across all three models, providing narrative context yields consistent improvements over Standard\-UFET\. BERT \(Tables[5](https://arxiv.org/html/2606.27598#A4.T5)–[7](https://arxiv.org/html/2606.27598#A4.T7)\) improves from 26\.3 to 33\.4 F1 with Narrative\-UFET\-Change, while Llama \(Tables[9](https://arxiv.org/html/2606.27598#A4.T9)–[11](https://arxiv.org/html/2606.27598#A4.T11)\) rises from 42\.5 to 49\.6, and Qwen \(Tables[13](https://arxiv.org/html/2606.27598#A4.T13)–[15](https://arxiv.org/html/2606.27598#A4.T15)\) shows the largest absolute gain, jumping from 29\.6 to 42\.2\. The gains are most pronounced at the ultra\-fine level, where single\-sentence context is least sufficient for this task\. BERT’s ultra\-fine F1 improves from 19\.2 to 25\.0, and Llama’s from 31\.6 to 40\.2\. This confirms that broader narrative context is especially valuable for resolving fine\-grained distinctions that a single sentence cannot give\. This is a stark contrast from coarse\-level performance which was already relatively strong under Standard\-UFET and improves more modestly with Narrative\-UFET\-Maintain\.

##### Narrative\-UFET\-Change vs Narrative\-UFET\-Maintain

Narrative\-UFET\-Change consistently outperforms Narrative\-UFET\-Maintain across all models, suggesting that narratives designed for type changes provide stronger typing signals\. Bin\-level trends show that the hardest examples \(Bin 1\) benefit substantially from narrative context, for example, Llama’s Bin 1 F1 rises from 36\.6 to 44\.0 under Narrative\-UFET\-Change \(Table[10](https://arxiv.org/html/2606.27598#A4.T10)\), while Bin 4, which is both the easiest and largest subset \(1,095 examples\), already performs well under Standard\-UFET and sees comparatively smaller relative gains\. Qwen’s Standard\-UFET baseline is notably recall\-poor \(22\.8 overall\) despite high precision \(42\.1\), a pattern narrative context largely corrects by recovering recall to 34\.4 under Narrative\-UFET\-Change\. The one notable exception is Qwen’s Bin 4 under Narrative\-UFET\-Maintain, which drops to 28\.5 F1 \(Table[15](https://arxiv.org/html/2606.27598#A4.T15)\), below its Standard\-UFET Bin 4 score of 25\.8 \(Table[13](https://arxiv.org/html/2606.27598#A4.T13)\), suggesting that for easier, high\-volume examples, poorly matched narrative framing can introduce noise rather than signal\.

##### Validating Narrative\-UFET with OntoNotes Context

The OntoNotes context gives real\-world data of a given UFET sentence\. The results of using this context sit consistently between the Standard\-UFET baseline and the Narrative\-UFET conditions, suggesting that real document context helps but that the structured, entity\-focused narratives provide a stronger signal\. For BERT \(Table[8](https://arxiv.org/html/2606.27598#A4.T8)\), the OntoNotes context yields an overall F1 of 36\.1, which exceeds the Standard\-UFET baseline of 26\.3 but falls short of Narrative\-UFET\-Change’s 33\.4\. Llama \(Table[12](https://arxiv.org/html/2606.27598#A4.T12)\) follows a similar pattern, reaching 41\.7 overall F1 under the additional context compared to 42\.5 on Standard\-UFET and 49\.6 on Narrative\-UFET\-Change, indicating that raw context alone does not consistently outperform the single\-sentence baseline for larger models\. Qwen \(Table[16](https://arxiv.org/html/2606.27598#A4.T16)\) reaches 33\.8 overall F1 with the context, above its weak Standard\-UFET baseline of 29\.6 but well below its 42\.2 under Narrative\-UFET\-Change\. Taken together, these results suggest that context quality and framing matter as much as context quantity, but narratives explicitly constructed around the entity mention yield the most reliable gains across models and granularity levels\.

![Refer to caption](https://arxiv.org/html/2606.27598v1/images/f1_scores/graph2_dataset1.png)Figure 11:F1 Scores Across Bins\. Comparing MLM and CLM performance across Narrative\-UFET\-Change of Narrative\-UFET and the Standard UFET \(sentence\-level\)\.![Refer to caption](https://arxiv.org/html/2606.27598v1/images/f1_scores/graph3_dataset2.png)Figure 12:F1 Scores Across Bins\. Comparing MLM and CLM performance between Narrative\-UFET\-Maintain of Narrative\-UFET and the Standard UFET \(sentence\-level\)\.![Refer to caption](https://arxiv.org/html/2606.27598v1/images/f1_scores/graph7_ontonotes_context.png)Figure 13:F1 Scores Across Bins\. Comparing MLM and CLM performance between original OntoNotes context and the Standard UFET \(sentence\-level\)\. Scored on the UFET type set\.![Refer to caption](https://arxiv.org/html/2606.27598v1/images/f1_scores/graph4_bert_d1_vs_d2.png)Figure 14:F1 Scores Across Bins\. Comparing MLM Bert Model between both datasets of Narrative\-UFET and the Standard UFET \(sentence\-level\)\.![Refer to caption](https://arxiv.org/html/2606.27598v1/images/f1_scores/graph5_llama_d1_vs_d2.png)Figure 15:F1 Scores Across Bins\. Comparing Llama Model between both datasets of Narrative\-UFET and the Standard UFET \(sentence\-level\)\.![Refer to caption](https://arxiv.org/html/2606.27598v1/images/f1_scores/graph6_qwen_d1_vs_d2.png)Figure 16:F1 Scores Across Bins\. Comparing Qwen Model between both datasets of Narrative\-UFET and the Standard UFET \(sentence\-level\)\.Table 5:BERT MLM \(bert\-base\-uncased\) w/ Standard\-UFETDeshmukh et al\. \([2025](https://arxiv.org/html/2606.27598#bib.bib4)\)Table 6:BERT MLM \(bert\-base\-uncased\) w/ Narrative\-UFET\-ChangeTable 7:BERT MLM \(bert\-base\-uncased\) w/ Narrative\-UFET\-MaintainTable 8:BERT MLM \(bert\-base\-uncased\) w/ OntoNotes ContextTable 9:Llama3\.3\-70B w/ Standard\-UFETTable 10:Llama3\.3\-70B w/ Narrative\-UFET\-ChangeTable 11:Llama3\.3\-70B w/ Narrative\-UFET\-MaintainTable 12:Llama3\.3\-70B w/ OntoNotes ContextTable 13:Qwen3\-32B w/ Standard\-UFETTable 14:Qwen3\-32B w/ Narrative\-UFET\-ChangeTable 15:Qwen3\-32B w/ Narrative\-UFET\-MaintainTable 16:Qwen3\-32B w/ OntoNotes Context
Narrative-UFET: Narrative Generation for Ultra-Fine Entity Typing

Similar Articles

@emollick: There is a lot being written about the stylistic tells of AI writing (em-dashes, etc.) but this paper looks at AI narra…

Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding

Discovering types for entity disambiguation

Characterizing Narrative Content in Web-scale LLM Pretraining Data

Navigating User Behavior toward Personalized Multimodal Generation

Submit Feedback

Similar Articles

@emollick: There is a lot being written about the stylistic tells of AI writing (em-dashes, etc.) but this paper looks at AI narra…
Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding
Discovering types for entity disambiguation
Characterizing Narrative Content in Web-scale LLM Pretraining Data
Navigating User Behavior toward Personalized Multimodal Generation