Characterizing Narrative Content in Web-scale LLM Pretraining Data

Hugging Face Daily Papers 06/17/26, 12:00 AM Papers

Summary

A fine-grained study of narrative features in web-scale LLM pretraining data, introducing NarraBERT and NarraDolma to measure narrative patterns and their distribution across sources.

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

Original Article

View Cached Full Text

Cached at: 06/22/26, 09:33 PM

Paper page - Characterizing Narrative Content in Web-scale LLM Pretraining Data

Source: https://huggingface.co/papers/2606.19468

Abstract

A comprehensive analysis of narrative structures in large-scale language model training data reveals measurable, multidimensional narrative patterns that vary across different content sources and topics.

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features inDolma, a 3-trillion-token openpretraining corpus. Drawing onnarrative theory, we design a framework spanning three core narrative elements (agency,setting, andevents) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validateNarraBERT, aRoBERTa-based model for fine-grained narrative prediction. We applyNarraBERTto 3M passages, resulting in a new dataset,NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly releaseNarraDolmaandNarraBERT.

View arXiv page View PDF Project page GitHub0 Add to collection

Models citing this paper2

#### teagrjohnson/narrative-event-relation-roberta Text Classification• Updated3 days ago • 77 • 2 #### teagrjohnson/narrative-likert-roberta Text Classification• Updated3 days ago • 94 • 1

Datasets citing this paper3

#### teagrjohnson/narrative-llm-annotations Updated3 days ago • 32 • 1 #### teagrjohnson/narrative-gold-annotations Updated3 days ago • 26 • 1 #### teagrjohnson/narradolma Updated3 days ago • 24 • 1

Characterizing Narrative Content in Web-scale LLM Pretraining Data

Paper page - Characterizing Narrative Content in Web-scale LLM Pretraining Data

Abstract

Models citing this paper2

Datasets citing this paper3

Spaces citing this paper1

Collections including this paper1

Similar Articles

BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories

Narrative Landscape: Mapping Narrative Dispositions Across LLMs

LLMs for automatic annotation of Mandarin narrative transcripts

NARRA-Gym for Evaluating Interactive Narrative Agents

Do Large Language Models Always Tell The Same Stories?

Submit Feedback

Similar Articles

BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories

Narrative Landscape: Mapping Narrative Dispositions Across LLMs

LLMs for automatic annotation of Mandarin narrative transcripts

NARRA-Gym for Evaluating Interactive Narrative Agents

Do Large Language Models Always Tell The Same Stories?