Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories
Summary
This paper diagnoses the low diversity in LLM-generated stories, finding that 88.3% of sampled stories contain one of 11 common words (e.g., Elias, lighthouse) across models, and traces this homogeneity to post-training data and alignment rather than prevalence in pre-training data.
View Cached Full Text
Cached at: 05/27/26, 09:06 AM
# Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories
Source: [https://arxiv.org/html/2605.26492](https://arxiv.org/html/2605.26492)
David Mimno Department of Information Science Cornell University \{srh255,mimno\}@cornell\.edu
###### Abstract
LLM\-generated stories are a popular use case, but they show very low variability\. We sample 20,000 total stories from four current models using five prompts\. We find that 11 words occur in 88\.3% of generated stories, with little difference between models\. These words include names \(Elias, Mara, Elara\), settings \(lighthouses\), and professions \(clockmaker, librarian\)\. These tokens do not often occur in published literature nor pre\-training data, but they are found in preference data that is likely to have been used by all current models\. Surprisingly, these “lighthouse” stories are infrequent when compared with the average post\-training story, much of which contains references to copyrighted characters or adult content\. This result demonstrates the potentially disproportionate impact of small datasets combined with powerful alignment algorithms\.
Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories
Sil Hamilton and David MimnoDepartment of Information ScienceCornell University\{srh255,mimno\}@cornell\.edu
## 1Introduction
Figure 1:We prompt four models to write 20,000 stories — 88\.3% of these stories contain at least one of 11 tokens at rates vastly higher than contemporary English literature \(here measured in parts per million\)\.The output of large language models \(LLMs\), even across model families, is becoming increasingly homogeneous\. This mode\-collapse phenomenon is unusually clear in creative writing\(Hamilton,[2024](https://arxiv.org/html/2605.26492#bib.bib5)\)\. While we know from prompt data that story\-writing is a popular use caseZhaoet al\.\([2024](https://arxiv.org/html/2605.26492#bib.bib18)\), and that readers prefer interesting and surprising literature\(Moretti,[2000](https://arxiv.org/html/2605.26492#bib.bib4)\), the stories generated by LLMs are remarkable in their sameness\.
Prior work has proposed post\-hoc solutions like adjusting sampling techniques\(Troshinet al\.,[2025](https://arxiv.org/html/2605.26492#bib.bib19)\)and new post\-training optimization objectives\(Chunget al\.,[2025](https://arxiv.org/html/2605.26492#bib.bib6)\)\. In this short paper we characterize story mode collapse and explore publicly available training data to locate the source\.
We generate 20,000 stories with four current models from OpenAI, Anthropic, Google, and the Allen Institute for AI \(AI2\), finding 88\.3% of generated stories contain one of 11 core words \(including character names, story locations, and professions\)\. Most notably, over half feature a lighthouse\. Why does this story pattern become favoured?
These 11 words are not common in published English literature, which suggests that post\-training data is responsible \([Figure 1](https://arxiv.org/html/2605.26492#S1.F1)\)\. But examining OLMo 3’s post\-training set reveals only 3,053 out of a total 78,958 stories contain one of our 11 words\.111We release the IDs of these documents[here](https://github.com/srhm-ca/elias/)\.We find that the dominance of “Elias in the Lighthouse” stories cannot be explained by prevalence in pre\- or post\-training data\. We speculate that models are trained to avoid references to copyrighted characters and adult content during alignment but defer this question to future work\.
## 2Related Work
Mode collapse, the tendency for a generative model to overfit on a small set of samples during training, was first observed in LLMs after post\-training techniques like SFT and RLHF were introduced\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.26492#bib.bib7); Baiet al\.,[2022](https://arxiv.org/html/2605.26492#bib.bib8); Hamilton,[2024](https://arxiv.org/html/2605.26492#bib.bib5)\)\. Fears of mode collapse deepened with growing synthetic data in training\(Gerstgrasseret al\.,[2024](https://arxiv.org/html/2605.26492#bib.bib13); Shumailovet al\.,[2024](https://arxiv.org/html/2605.26492#bib.bib15)\)\. Mode collapse has been demonstrated in linguistic markers\(Paech,[2025](https://arxiv.org/html/2605.26492#bib.bib9)\), query answers\(Zhanget al\.,[2025](https://arxiv.org/html/2605.26492#bib.bib11); Lagzianet al\.,[2025](https://arxiv.org/html/2605.26492#bib.bib12)\), and stories\(Doshi and Hauser,[2024](https://arxiv.org/html/2605.26492#bib.bib10)\)\. Prior work has proposed modifying prompts\(Lagzianet al\.,[2025](https://arxiv.org/html/2605.26492#bib.bib12)\), sampling techniques\(Troshinet al\.,[2025](https://arxiv.org/html/2605.26492#bib.bib19)\), and post\-training algorithms\(Kirket al\.,[2024](https://arxiv.org/html/2605.26492#bib.bib16)\)— but the phenomenon remains poorly understood\. To our knowledge, this study is the first to study a specific instance of mode collapse by tracing output back to training input\.
## 3Characterizing Repetitive Story Traits
The lighthouse at the edge of the world did not guide ships; it signaled to the stars\.Elias had been the keeper for forty years\. He was a man composed of salt, solitude, and the rhythmic ticking of gears\. The lighthouse was a towering spire of obsidian, carved directly into a jagged needle of rock that rose from a sea so still it looked like polished slate\.Every night, Elias climbed the two hundred and twelve steps to the lantern room\. He didn’t use oil or electricity\. Instead, he tended to a sphere of captured nebula — a swirling, violet bruise of light that pulsed in time with his own heartbeat…
Figure 2:Written by Gemini 3\.1 Flash\-Lite when prompted to “write a story\.” Lighthouses are present in half of all 20,000 stories generated for this experiment\.Current LLMs from OpenAI, Anthropic, Google, and AI2 write similar stories when prompted to “write a story” with no additional constraints\.222We find similar behavior for more complex prompts, but we focus on simple prompts for this initial study\.We use four current models: Claude Haiku 4\.5, Gemini 3\.1 Flash\-Lite, GPT\-5\.4\-Mini, and OLMo 7b Thinking\.333We selected smaller models to maximize sample size for a fixed budget, we observe large and small models belonging to the same family share storywriting behavior\.We prompt each model with five requests \(“Write a story,” “Please write a story,” “Write me a story,” “Tell me a story,” and “Please tell a story”\) 1,000 times each, yielding 20,000 total stories totalling 12\.8 million words\.444All models were accessed via OpenRouter for a total cost of $180 U\.S\. dollars\. Endpoints available as of April 2026\.
A typical example shown in[Figure 2](https://arxiv.org/html/2605.26492#S3.F2)highlights three elements common across nearly all 20,000 stories: a location \(19,864 stories\), a character name \(19,864 stories\), and a profession \(15,807 stories\)\. In fact, the specific location \(“lighthouse”\), name \(“Elias”\), and profession \(“keeper”\) in this story appear in some combination across 66\.6% of all generated stories\.Lightis likewise a common theme: 56% of stories generated by Claude are titled “The Lighthouse Keeper’s Secret” and the word “light” appears in 16,784 stories at an average rate of 3\.2 instances per story\.
Other common names include Mara and Elara; locations include lighthouses and villages; and professions include clockmakers, fishermen, and librarians\. Nearly all stories combine two or more of these three elements, suggesting models are sampling each from some common pool of candidates\. What other words does the pool contain?
To construct vocabulary lists useful for downstream analysis, we use GPT\-5\.4\-nano to identify token spans corresponding with story settings, characters’ first names, and their professions\.555All prompts are in[Appendix A](https://arxiv.org/html/2605.26492#A1)\.We verify the presence of each extracted span before filtering candidates in three steps: \(i\) we tokenize strings on whitespace, yielding multiple tokens per string; \(ii\) for each story and category we retain the extracted token with the highest corpus\-level frequency; and \(iii\) we retain all tokens emitted by at least half of all models\. Removing incoherent candidates yields 663 tokens: 247 locations, 71 names, and 345 professions across all stories\.
Table 1:Counts for the most frequent words in our corpus as measured in parts per million words \(PPM\) versus representative samples of English literature \(LIT\), \(non\-\)fiction web data \(PRE\-NON/FIC\), and \(non\-\)fiction post\-training data \(POST\-NON/FIC\)\. Corpora containing the most Core tokens in bold\.
Table 2:Core rates by alignment stage\.Table 3:Core story prevalence by source\. Alignment datasets have 5–8x higher Core density than WildChat\-derived stories despite being 80% of all post\-training stories\.
Within the candidate vocabulary words, we additionally select a Core vocabulary of 11 words using a changepoint analysis to find a minimal set of candidate tokens most common across all stories\(Killicket al\.,[2012](https://arxiv.org/html/2605.26492#bib.bib2)\)\. 88\.3% of stories contain a Core token\. The Core includes names \(“Elias” is in 26\.5% of all stories, “Mara” in 16\.7%, and “Elara” in 13\.1%\), professions \(“keeper” at 48\.1% of all stories, while “clockmaker,” “baker,” “fisherman,” “librarian,” “mayor,” and “conductor” each occur in 1\.9% to 6\.6% of stories\), and a single location: “lighthouse” with a frequency of 51\.2%\.
The core words and a second tier of 50 additional words are given in[Appendix B](https://arxiv.org/html/2605.26492#A2), while Core PPM rates per models are shown in[Table 5](https://arxiv.org/html/2605.26492#A3.T5)\. 98% of stories contain at least one of these 61 words, while 49\.1% contain a full name\-profession\-location triple\. Words \(especially names\) vary by model as shown in[Appendix C](https://arxiv.org/html/2605.26492#A3), but nearly all terms are used by all models\. Professions suggest an idyllic, pre\-modern setting: clockmaker, blacksmith, innkeeper, keeper, baker, fisherman\. Other tokens describe curation \(restorer, collector, caretaker\)\.
## 4Tracing Story Traits To Training Data
Figure 3:t\-SNE of Topic model over all stories in OLMo 3’s post\-training set \(left\), and Core stories \(right\)\. Stories highlighted in red are “lighthouse” stories, spread across many topics, including toilet humor and fan fiction\.Frequent Core vocabulary in LLM story generation cannot be explained by the frequency of those words in published English fiction, pre\-training data, or post\-training data\. We assess each potential source by comparing Core rates in our corpus with English corpora\. We give rates in[Table 3](https://arxiv.org/html/2605.26492#S3.T3)\.
The simplest explanation would be that Core tokens are common in English literature\. We consulted CONLIT, a corpus containing 2,700 contemporary English novels published between 2007\-2021 across 12 genres of fiction with≈287\\approx 287million total words\(Piper,[2022](https://arxiv.org/html/2605.26492#bib.bib3)\)\. The frequency of Core tokens is far greater in generated stories than published fiction, e\.g\. “Elias” is 900 times more frequent in our corpus\. To assess amateur fiction we consulted stories on the subreddit/r/writingprompts\(Huanget al\.,[2024](https://arxiv.org/html/2605.26492#bib.bib21)\)\. Rates are similar to CONLIT \([Appendix D](https://arxiv.org/html/2605.26492#A4)\), suggesting models do not track human storywriting patterns\.
To assess whether these tokens are common in English web data, we turn to OLMo 3, whose training data includes Common Crawl and is freely available\. OLMo 3 was trained on≈\\approx3\.89 billion predominantly human written documents during pre\-training, of which 33 million are marked asLiterature\. Across these documents we find near\-negligible Core PPM rates \(e\.g\. “Elara” appears 0\.7 times per million words\)\. To ensure that we are looking at webstoriesand not non\-fiction literature, we train a fiction classifier with 200k balanced samples from OLMo’s pre\-training corpus annotated for narrativity with GPT\-OSS 20b with the following prompt inspired byPiper and Bagga \([2025](https://arxiv.org/html/2605.26492#bib.bib1)\): “Is this passage a work of fiction? Answer only with a number: 1 if yes, 0 if no\.” We then train a FastText classifier and evaluate on 400 balanced samples of fiction and non\-fiction in CONLIT, achieving aF1F\_\{1\}of 0\.84 \(precision = 0\.75, recall = 0\.98\)\. Filtering by this classification shows a slight \(≈2×\\approx 2\\times\) increase in some Core words in thefictionportion, but nowhere near the rate in our generated stories\.
If Core words are not common in web data, then one remaining source would be post\-training data\. But we find that OLMo’s post\-training data exhibits our tokens at alowerrate than CONLIT\. Using the same fiction classifier, we find that 78,958storiesin post\-training data show the highest concentration of Core words of any of the training or literature subsets, but even then dramatically less than generated stories: “Elias” occurs 52\.7 PPM in OLMo 3 stories vs\. 2\.7 in CONLIT, but 2428 in our corpus\.
### Which dataset\(s\) are contributing Core tokens?
This suggests OLMo 3 learns to write Core stories from relatively few samples\. To understand which datasets are contributing these stories, we assign a binary score to each story indicating the presence of one or more Core tokens\.666For documents containing accepted/rejected pairs, we only consider the accepted sample\.We expected the majority of Core stories to appear in SFT data because WildChat \(and derivatives\) are the most story dominant source for OLMo at 59,266 total stories\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.26492#bib.bib18)\)\. But only 1,803 of these stories contain Core tokens, and measuring the Core rate by alignment stage \(e\.g\. SFT, DPO, and RL\) shows DPO & RL contribute relatively more Core stories than SFT \([Table 3](https://arxiv.org/html/2605.26492#S3.T3)&[Table 3](https://arxiv.org/html/2605.26492#S3.T3)\)\. We find OLMo 3 learns Core vocabulary from 3,053 examples, or 3\.8% of all stories observed during post\-training\.
## 5Post\-Training Story Genres
To better understand what kinds of stories OLMo 3 encounters during post\-training, we trained a 10\-topic LDA topic model\(Bleiet al\.,[2003](https://arxiv.org/html/2605.26492#bib.bib20)\)against the full post\-training story corpus \([Figure 3](https://arxiv.org/html/2605.26492#S4.F3)\)\. We find a diverse range of content, with dominant topics including fan fiction for popular Japanese media, video games, and American cartoons\. As expected given their relative frequency, “lighthouse” stories do not form a single topic, and are instead spread out across our discovered topics\. They are particularly concentrated in clusters containing generic fiction, but they nonetheless fail to dominate any topic\. A close reading reveals several topics frequently feature stories containing inappropriate humour and adult content, surprising considering OLMo 3 will not typically emit inappropriate content when writing\. Future work will want to investigate whether these stories fail to trigger safety and quality filters used for data cleaning, and why if so\.
## 6Conclusion
When given little direction, current frontier models write stories using a narrow catalog of names, places, and occupations\. Recurring characters in these stories include Elias, a lighthouse keeper\. Elias is unusual; the name is uncommon in literature, web data, and even post\-training data\. We have found that of the 78,958 stories exposed to OLMo 3 during post\-training, only≈3,053\\approx 3\{,\}053stories contain one or more of these 11 unusual tokens\. But despite constituting only 3\.8% of post\-training stories — and7\.71×10−77\.71\\times 10^\{\-7\}of the≈4\\approx 4billion total documents OLMo 3 was trained on — these “lighthouse” stories hold a disproportionately large influence over what stories the model writes in practice\. This suggests models do not simply mimic the dominant patterns in their training corpora\. Future work will want to determine whether alignment causes models to prefer the “safest” \(for work\) samples in post\-training, avoiding the potentially unsafe topic matter of many stories they otherwise encounter\.
## Limitations
Our experiment is expressly monolingual to prevent confounders stemming from unanticipated multilingual behaviour, but it would be valuable for future work to explore how multilingual storywriting prompts impact the phenomena observed\.
## References
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan, N\. Joseph, S\. Kadavath, J\. Kernion, T\. Conerly, S\. El\-Showk, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, T\. Hume, S\. Johnston, S\. Kravec, L\. Lovitt, N\. Nanda, C\. Olsson, D\. Amodei, T\. Brown, J\. Clark, S\. McCandlish, C\. Olah, B\. Mann, and J\. Kaplan \(2022\)Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback\.arXiv\.External Links:2204\.05862,[Document](https://dx.doi.org/10.48550/arXiv.2204.05862)Cited by:[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- D\. M\. Blei, A\. Y\. Ng, and M\. I\. Jordan \(2003\)Latent dirichlet allocation\.Journal of machine Learning research3\(Jan\),pp\. 993–1022\.Cited by:[§5](https://arxiv.org/html/2605.26492#S5.p1.1)\.
- J\. J\. Y\. Chung, V\. Padmakumar, M\. Roemmele, Y\. Sun, and M\. Kreminski \(2025\)Modifying Large Language Model Post\-Training for Diverse Creative Writing\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2503.17126)Cited by:[§1](https://arxiv.org/html/2605.26492#S1.p2.1)\.
- A\. R\. Doshi and O\. P\. Hauser \(2024\)Generative AI enhances individual creativity but reduces the collective diversity of novel content\.Science Advances10\(28\),pp\. eadn5290\.External Links:ISSN 2375\-2548,[Document](https://dx.doi.org/10.1126/sciadv.adn5290)Cited by:[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- M\. Gerstgrasser, R\. Schaeffer, A\. Dey, R\. Rafailov, H\. Sleight, J\. Hughes, T\. Korbak, R\. Agrawal, D\. Pai, A\. Gromov, D\. A\. Roberts, D\. Yang, D\. L\. Donoho, and S\. Koyejo \(2024\)Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data\.arXiv\.External Links:2404\.01413Cited by:[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- S\. Hamilton \(2024\)Detecting mode collapse in language models via narration\.InProceedings of the First edition of the Workshop on the Scaling Behavior of Large Language Models \(SCALE\-LLM 2024\),pp\. 65–72\.Cited by:[§1](https://arxiv.org/html/2605.26492#S1.p1.1),[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- X\. Y\. Huang, K\. Vishnubhotla, and F\. Rudzicz \(2024\)The gpt\-writingprompts dataset: a comparative analysis of character portrayal in short stories\.arXiv preprint arXiv:2406\.16767\.Cited by:[§4](https://arxiv.org/html/2605.26492#S4.p2.1)\.
- R\. Killick, P\. Fearnhead, and I\. A\. Eckley \(2012\)Optimal detection of changepoints with a linear computational cost\.Journal of the American Statistical Association107\(500\),pp\. 1590–1598\.Cited by:[§3](https://arxiv.org/html/2605.26492#S3.p5.1)\.
- R\. Kirk, I\. Mediratta, C\. Nalmpantis, J\. Luketina, E\. Hambro, E\. Grefenstette, and R\. Raileanu \(2024\)Understanding the Effects of RLHF on LLM Generalisation and Diversity\.InICLR 2024,External Links:2310\.06452,[Document](https://dx.doi.org/10.48550/arXiv.2310.06452)Cited by:[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- A\. Lagzian, S\. Anumasa, and D\. Liu \(2025\)Multi\-Novelty: Improve the Diversity and Novelty of Contents Generated by Large Language Models via inference\-time Multi\-Views Brainstorming\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2502.12700)Cited by:[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- F\. Moretti \(2000\)The Slaughterhouse of Literature\.Modern Language Quarterly61\(1\),pp\. 207–228\.External Links:ISSN 0026\-7929, 1527\-1943,[Document](https://dx.doi.org/10.1215/00267929-61-1-207)Cited by:[§1](https://arxiv.org/html/2605.26492#S1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.arXiv\.External Links:2203\.02155,[Document](https://dx.doi.org/10.48550/arXiv.2203.02155)Cited by:[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- S\. Paech \(2025\)Sam\-paech/slop\-forensicsCited by:[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- A\. Piper and S\. Bagga \(2025\)NarraDetect: an annotated dataset for the task of narrative detection\.InProceedings of the The 7th Workshop on Narrative Understanding,pp\. 1–7\.Cited by:[§4](https://arxiv.org/html/2605.26492#S4.p3.3)\.
- A\. Piper \(2022\)The conlit dataset of contemporary literature\.Journal of Open Humanities Data8\.Cited by:[§4](https://arxiv.org/html/2605.26492#S4.p2.1)\.
- I\. Shumailov, Z\. Shumaylov, Y\. Zhao, N\. Papernot, R\. Anderson, and Y\. Gal \(2024\)AI models collapse when trained on recursively generated data\.Nature631\(8022\),pp\. 755–759\.External Links:ISSN 1476\-4687,[Document](https://dx.doi.org/10.1038/s41586-024-07566-y)Cited by:[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- S\. Troshin, W\. Mohammed, Y\. Meng, C\. Monz, A\. Fokkens, and V\. Niculae \(2025\)Control the Temperature: Selective Sampling for Diverse and High\-Quality LLM Outputs\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2510.01218)Cited by:[§1](https://arxiv.org/html/2605.26492#S1.p2.1),[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- Y\. Zhang, H\. Diddee, S\. Holm, H\. Liu, X\. Liu, V\. Samuel, B\. Wang, and D\. Ippolito \(2025\)NoveltyBench: Evaluating Language Models for Humanlike Diversity\.arXiv\.External Links:2504\.05228,[Document](https://dx.doi.org/10.48550/arXiv.2504.05228)Cited by:[§2](https://arxiv.org/html/2605.26492#S2.p1.1)\.
- W\. Zhao, X\. Ren, J\. Hessel, C\. Cardie, Y\. Choi, and Y\. Deng \(2024\)WildChat: 1M ChatGPT Interaction Logs in the Wild\.arXiv\.External Links:2405\.01470,[Document](https://dx.doi.org/10.48550/arXiv.2405.01470)Cited by:[§1](https://arxiv.org/html/2605.26492#S1.p1.1),[§4](https://arxiv.org/html/2605.26492#S4.SS0.SSS0.Px1.p1.1)\.
## Appendix APrompts
In this section we provide all prompts used over the course of the experiment\.
### Prompts for story generation\.
The five prompts used to generate stories are as follows\.
Write a story\.
Please write a story\.
Write me a story\.
Tell me a story\.
Please tell a story\.
### Name, location, profession extraction\.
The system prompt for extracting names, locations, and professions is as follows\.
You extract structured metadata from short stories\.Return JSON only, with this exact schema:\{"character\_names": \["first name only", \.\.\.\],"settings": \["place or location noun phrase", \.\.\.\],"professions": \["profession or role noun", \.\.\.\]\}Rules:\- Output valid JSON and nothing else\.\{internallinenumbers\*\}\- \`character\_names\` must contain only first names for named human or human\-like characters\.\- Exclude surnames, titles, pronouns, groups, and unnamed roles\.\{internallinenumbers\*\}\- \`settings\` should be concise setting/location phrases from the story, e\.g\. "Lighthouse", "Village square", "Oakhaven"\.\{internallinenumbers\*\}\- \`professions\` should be concise occupations or stable roles, e\.g\. "Clockmaker", "Lighthouse keeper", "Baker"\.\- Deduplicate while preserving first appearance order\.\- If a field has no items, return an empty list\.\- Every returned string must be an exact token span copied from the story text\.
The user prompt for extracting names, locations, and professions is as follows\.
Read the story and answer these questions:1\. Who are the characters in the text?2\. What is the character’s role in the text?3\. What is the setting?Return JSON only in this exact schema:\{\{"character\_names": \["first name only", \.\.\.\],"settings": \["place or location noun phrase", \.\.\.\],"professions": \["profession or stable role noun phrase", \.\.\.\]\}\}Additional rules:\{internallinenumbers\*\}\- For \`character\_names\`, include only first names for named human or human\-like characters\.\{internallinenumbers\*\}\- For \`professions\`, map each character’s role in the text to concise occupations or stable roles when present\.\- For \`settings\`, list the main setting locations or place names\.\- Every returned item must be an exact contiguous span copied from the story text\.\- Do not normalize, paraphrase, singularize, pluralize, or invent labels\.\- If the exact answer is not present as a span in the story, omit it\.\- Deduplicate items while preserving first appearance order\.\- Use empty lists when something is absent\.Story:\{story\}
The prompt for identifying fictionality is as follows\.
\`\`Is this passage a work of fiction? Answer only with a number: 1 if yes, 0 if no\.’’
## Appendix BCommon Tokens
In[Table 4](https://arxiv.org/html/2605.26492#A2.T4)we present all 61 Core and additional tokens common to LLM generated short stories we identified in our experiment\.
Names
Professions TokenCountTokenCountCorekeeper9,609fisherman673baker1,325librarian592mayor975conductor389clockmaker958Additionalguard1,454apothecary100captain1,417curator94guardian1,037mapmaker93sailor514watchmaker86jeweler383weaver84priest352historian80owner287innkeeper80collector281scholar80blacksmith247clerk73miller244hermit64elder201farmer64doctor198custodian61restorer193engineer47cartographer155lawyer45archivist151stationmaster44caretaker143scientist35apprentice126mender32shopkeeper123healer27teacher122scavenger21postman101proprietor17
Locations
Table 4:Tokens in the first two changepoint segments of each category’s coverage curve, with token hit counts across the generated story corpus\. Within each segment tokens are listed in descending hit\-count order\.
## Appendix CPer\-Model Core PPM
We present Core token concentrations across all stories generated by each model in[Table 5](https://arxiv.org/html/2605.26492#A3.T5)\.
Table 5:PPM for each Core token across all models\. Bolded values indicate each token’s most frequent model\.
## Appendix DCore Token Frequencies on Reddit
We calculate Core token frequencies on non\-published human\-written literature from Reddit in[Table 6](https://arxiv.org/html/2605.26492#A4.T6)\.
Table 6:Core token frequencies in the human\-written WritingPrompts corpus which contains 272,600 stories\.Similar Articles
More Aligned, Less Diverse? Analyzing the Grammar and Lexicon of Two Generations of LLMs
This academic paper analyzes the syntactic and lexical diversity of two generations of LLMs compared to human-authored news text, finding that newer, aligned models exhibit reduced diversity.
Do Large Language Models Always Tell The Same Stories?
This paper investigates whether large language models generate diverse stories. Using narrative similarity analysis, the authors find that LLM-generated narratives are consistently more similar to each other than human-written stories, and that common mitigation strategies like negative prompting and temperature scaling fail to address this homogeneity.
Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
This paper introduces a validity-diversity framework attributing diversity collapse in LLMs to order and shape miscalibration during decoding, validated across 14 language models.
BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories
Researchers introduce BIASEDTALES-ML, a large-scale multilingual dataset of ~350,000 LLM-generated children's stories across eight languages, designed to analyze narrative attribute distributions and cross-lingual bias patterns in language model outputs. The work reveals significant cross-lingual variability, highlighting limitations of English-centric bias evaluations.
Where does output diversity collapse in post-training?
This paper investigates where and why output diversity collapses during post-training of language models, analyzing three OLMo 3 lineages (Think, Instruct, RL-Zero) across multiple tasks and metrics. The authors find that diversity collapse is primarily determined by training data composition and embedded in model weights during training, not addressable at inference time alone.