Characterizing Narrative Content in Web-scale LLM Pretraining Data
Summary
A fine-grained study of narrative features in web-scale LLM pretraining data, introducing NarraBERT and NarraDolma to measure narrative patterns and their distribution across sources.
View Cached Full Text
Cached at: 06/22/26, 09:33 PM
Paper page - Characterizing Narrative Content in Web-scale LLM Pretraining Data
Source: https://huggingface.co/papers/2606.19468
Abstract
A comprehensive analysis of narrative structures in large-scale language model training data reveals measurable, multidimensional narrative patterns that vary across different content sources and topics.
The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features inDolma, a 3-trillion-token openpretraining corpus. Drawing onnarrative theory, we design a framework spanning three core narrative elements (agency,setting, andevents) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validateNarraBERT, aRoBERTa-based model for fine-grained narrative prediction. We applyNarraBERTto 3M passages, resulting in a new dataset,NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly releaseNarraDolmaandNarraBERT.
View arXiv pageView PDFProject pageGitHub0Add to collection
Models citing this paper2
#### teagrjohnson/narrative-event-relation-roberta Text Classification• Updated3 days ago • 77 • 2
#### teagrjohnson/narrative-likert-roberta Text Classification• Updated3 days ago • 94 • 1
Datasets citing this paper3
#### teagrjohnson/narrative-llm-annotations Updated3 days ago • 32 • 1 #### teagrjohnson/narrative-gold-annotations Updated3 days ago • 26 • 1 #### teagrjohnson/narradolma Updated3 days ago • 24 • 1
Spaces citing this paper1
Collections including this paper1
Similar Articles
BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories
Researchers introduce BIASEDTALES-ML, a large-scale multilingual dataset of ~350,000 LLM-generated children's stories across eight languages, designed to analyze narrative attribute distributions and cross-lingual bias patterns in language model outputs. The work reveals significant cross-lingual variability, highlighting limitations of English-centric bias evaluations.
Narrative Landscape: Mapping Narrative Dispositions Across LLMs
This paper introduces a quantitative framework and visualization tool called 'Narrative Landscape' to map and compare the narrative dispositions and stability of frontier LLMs.
LLMs for automatic annotation of Mandarin narrative transcripts
This paper evaluates LLMs for automatically annotating narrative macrostructure in spoken Mandarin, finding that the best model achieves near-human reliability while reducing annotation time by 65%, though performance degrades on semantically complex or lexically diverse narratives.
NARRA-Gym for Evaluating Interactive Narrative Agents
This paper introduces NARRA-Gym, a benchmark and executable evaluation environment for assessing Large Language Models' abilities in sustaining interactive narratives, managing memory, and adapting to users over multiple turns.
Do Large Language Models Always Tell The Same Stories?
This paper investigates whether large language models generate diverse stories. Using narrative similarity analysis, the authors find that LLM-generated narratives are consistently more similar to each other than human-written stories, and that common mitigation strategies like negative prompting and temperature scaling fail to address this homogeneity.