NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

arXiv cs.CL Papers

Summary

This paper introduces NarrativeWorldBench, a benchmark for evaluating long-horizon narrative consistency in audio dramas, and N-VSSM, a latent state-space model that outperforms frontier LLMs across multiple horizons and languages.

arXiv:2606.17391v1 Announce Type: new Abstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 >= 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:40 AM

# A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama
Source: [https://arxiv.org/html/2606.17391](https://arxiv.org/html/2606.17391)
Logan Mann1Abdur Rahman2Mohammad Saifullah2Taaha Kazi2Vasu Sharma2 1University of California, Santa Barbara2Pocket FM

###### Abstract

Long\-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models \(LLMs\) fail\. We benchmark 21 models, spanning classical, fine\-tuned, open\-frontier, closed\-frontier, and reasoning tiers, on a uniform set of structural narrative metrics\. All closed\-frontier systems saturate at a plot\-beat F1 in the band\[0\.78,0\.81\]\[0\.78,0\.81\]and collapse by about−0\.20\-0\.20F1 at horizonh=200h=200\. We introduce NarrativeWorldBench, an open benchmark of nine narrative\-structure metrics evaluated across horizonsh∈\{10,20,50,100,200\}h\\in\\\{10,20,50,100,200\\\}, with cross\-lingual evaluation across four Indic languages \(Hindi, Tamil, Telugu, Marathi\)\. We introduce N\-VSSM, a Narrative Variational State\-Space Model that maintains a structured 256\-dimensional latent world state over more than 200 episodes via a Mamba\-2 backbone with an event\-conditioned posterior and an 8B decoder\. N\-VSSM holds plot\-beat F1≥0\.84\\geq 0\.84across all horizons at 4x lower compute than the closed\-frontier band\. A learned Cultural Transfer Function lifts cross\-language fidelity by\+0\.20\+0\.20to\+0\.23\+0\.23Likert points\. In a within\-subjects writer study \(n = 12 professional authors, 240 trials\), N\-VSSM is preferred over Claude Opus 4\.5 on long\-arc consistency 71% of the time and rated\+1\.3\+1\.3Likert points higher on controllability\.

## 1Introduction

Audio dramas, fictional podcasts, and immersive audio series are a fast\-growing creative form\. Individual arcs commonly span 200 to 800 episodes\. Global production now exceeds 60,000 active series with an estimated 2 billion monthly listeners\. The defining computational problem in this medium is not one\-shot quality but horizon: a system must keep a story coherent over hundreds of episodes while a human collaborator steers it\.

Existing long\-context benchmarks do not measure this\. LongBench\[[2](https://arxiv.org/html/2606.17391#bib.bib2)\], RULER\[[8](https://arxiv.org/html/2606.17391#bib.bib8)\], NoCha\[[11](https://arxiv.org/html/2606.17391#bib.bib11)\], FActScore\[[12](https://arxiv.org/html/2606.17391#bib.bib12)\], and L\-Eval\[[1](https://arxiv.org/html/2606.17391#bib.bib1)\]evaluate retrieval, factual recall, and summarization\. None of them measures structural narrative consistency under co\-creative continuation, where the model must extend an ongoing arc that a writer is actively shaping\.

We make three contributions\.

1. 1\.NarrativeWorldBench, a benchmark of nine narrative\-structure metrics across five horizons up toh=200h=200, in English plus four Indic languages\.
2. 2\.A frontier auditof 21 LMs that reveals a saturation ceiling in the band\[0\.78,0\.81\]\[0\.78,0\.81\]and a uniform−0\.20\-0\.20F1 collapse ath=200h=200\.
3. 3\.N\-VSSM and a learned Cultural Transfer Function, a latent world model that crosses the ceiling and a representational transform that recovers cross\-language fidelity\.

## 2Related Work

#### Long\-context benchmarks\.

LongBench\[[2](https://arxiv.org/html/2606.17391#bib.bib2)\], RULER\[[8](https://arxiv.org/html/2606.17391#bib.bib8)\], L\-Eval\[[1](https://arxiv.org/html/2606.17391#bib.bib1)\], InfinityBench\[[18](https://arxiv.org/html/2606.17391#bib.bib18)\], and NoCha\[[11](https://arxiv.org/html/2606.17391#bib.bib11)\]stress retrieval and recall over long inputs\. They are complementary to our setting but do not score narrative structure\.

#### Story and screenplay generation\.

Re3\[[16](https://arxiv.org/html/2606.17391#bib.bib16)\], DOC\[[17](https://arxiv.org/html/2606.17391#bib.bib17)\], Dramatron\[[13](https://arxiv.org/html/2606.17391#bib.bib13)\], WritingBench\[[15](https://arxiv.org/html/2606.17391#bib.bib15)\], and story\-generation\-as\-search\[[4](https://arxiv.org/html/2606.17391#bib.bib4)\]target plan\-and\-write or search\-based long\-form generation\. The closest prior work is learned planners\[[14](https://arxiv.org/html/2606.17391#bib.bib14)\]and structured\-memory transformers\[[9](https://arxiv.org/html/2606.17391#bib.bib9)\], both of which add explicit structure on top of a base generator\.

#### State\-space models\.

S4\[[7](https://arxiv.org/html/2606.17391#bib.bib7)\], Mamba\[[6](https://arxiv.org/html/2606.17391#bib.bib6)\], and Mamba\-2\[[5](https://arxiv.org/html/2606.17391#bib.bib5)\]provide linear\-time sequence backbones that scale to long contexts\. N\-VSSM uses Mamba\-2 as its decoder\.

#### Cross\-cultural NLG\.

Underspecification in localization\[[10](https://arxiv.org/html/2606.17391#bib.bib10)\]and cultural alignment in LLMs\[[3](https://arxiv.org/html/2606.17391#bib.bib3)\]motivate our cross\-lingual protocol and the Cultural Transfer Function\.

## 3NarrativeWorldBench

### 3\.1Source corpus

The benchmark draws on 1,204 serialized audio drama continuation instances from 38 series, all released under open redistribution licenses \(CC\-BY 4\.0 or CC\-BY\-SA 4\.0\)\. The corpus is genre\-balanced across drama, thriller, fantasy, science fiction, slice\-of\-life, and mystery\. Every series has an arc length of at least 80 episodes\. The average episode length is 4,820 words\. Arcs span 80 to 412 episodes, with a median of 178\.

### 3\.2Evaluation horizons and protocol

We evaluate at five horizonsh∈\{10,20,50,100,200\}h\\in\\\{10,20,50,100,200\\\}\. For each instance, the model is conditioned on episodes1​…​k1\\dots ktogether with a structured scene plan for episodek\+hk\+h, and it must produce episodek\+hk\+h\. The intervening episodesk\+1​…​k\+h−1k\+1\\dots k\+h\-1are held out: the model receives only the structured scene plan, never the held\-out episodes\. This isolates the model’s ability to carry narrative state forward rather than to copy or retrieve\.

### 3\.3Metrics

NarrativeWorldBench reports nine metrics, all automatable and reproducible\.

- •Plot\-Beat F1\(primary\): F1 over a 14\-class Save\-the\-Cat plot\-beat taxonomy, extracted by a held\-out judge ensemble\.
- •Character\-Voice Consistency: cosine distance between per\-character style\-embedding centroids\.
- •World\-Rule Violation Rate: violations of per\-series rules drawn from a series bible\.
- •Foreshadowing Payoff Rate: fraction of introduced foreshadowing that is paid off withinhhepisodes\.
- •Temporal Coherence: ordering\-violation rate over extracted event chains\.
- •Thematic Recurrence: KL divergence between motif distributions in held\-out and generated arcs\.
- •Emotion\-Arc Alignment: dynamic time warping \(DTW\) over per\-scene valence and arousal traces\.
- •Dialogue\-Attribution Accuracy: speaker\-identification F1\.
- •Motif Persistence: overlap of per\-motif lifespan distributions\.

### 3\.4Cross\-cultural localization

We translate the held\-out prompts into Hindi, Tamil, Telugu, and Marathi using three professional translators per language with back\-translation review\. Three native\-speaker raters per episode then score cultural fidelity on a calibrated 7\-point Likert scale covering idiom, social context, and register\.

## 4Frontier Audit

### 4\.1Systems

We evaluate 21 systems in five tiers\.

- •Classical: GPT\-3\.5\-Turbo, Llama\-2\-70B\.
- •Fine\-tuned narrative baselines: DOC, Dramatron, and Re3, re\-implemented on Llama\-3\-70B\.
- •Open\-frontier: Llama\-3\.1\-405B, DeepSeek\-V3, Qwen\-2\.5\-72B, Mixtral\-8x22B, and additional open systems\.
- •Closed\-frontier: Claude Opus 4\.5, GPT\-5, Gemini\-2\.5\-Pro, Grok\-4\-Heavy\.
- •Reasoning\-tier: o3\-Pro, Claude Opus 4\.5 \(thinking\), Gemini\-2\.5\-Pro \(deep\-think\), DeepSeek\-R1\.

All systems use temperature 0\.7 and top\-p 0\.95\. We report 95% confidence intervals over 5 seeds\.

### 4\.2Saturation

Ath=50h=50, the closed\-frontier and reasoning tiers cluster tightly in the band\[0\.78,0\.81\]\[0\.78,0\.81\]\. Welch’s t\-test with Bonferroni correction over\(82\)=28\\binom\{8\}\{2\}=28pairwise comparisons finds no significant difference for any closed\-versus\-reasoning pair \(p\>0\.13p\>0\.13for all\)\. Table[1](https://arxiv.org/html/2606.17391#S4.T1)reports plot\-beat F1 ath=50h=50\. Figure[1](https://arxiv.org/html/2606.17391#S4.F1)visualizes the same band\.

Table 1:Plot\-Beat F1 ath=50h=50\. Closed\-frontier and reasoning systems saturate in the band\[0\.78,0\.81\]\[0\.78,0\.81\]\. N\-VSSM is the only system above the band\. Values are mean±\\pm95% CI over 5 seeds\.![Refer to caption](https://arxiv.org/html/2606.17391v1/figures/fig2_saturation_h50.png)Figure 1:Saturation ath=50h=50\. Closed\-frontier and reasoning\-tier systems cluster in the band\[0\.78,0\.81\]\[0\.78,0\.81\]regardless of scale or reasoning budget, while N\-VSSM sits above the band\.
### 4\.3Horizon collapse

Across horizons, every closed and reasoning system loses−0\.18\-0\.18to−0\.21\-0\.21F1 fromh=10h=10toh=200h=200\. The decline is monotonic and significant \(Bonferroni correction over 8 model\-by\-horizon contrasts,p<10−4p<10^\{\-4\}each\)\. Table[2](https://arxiv.org/html/2606.17391#S4.T2)reports plot\-beat F1 across all five horizons\. Figure[2](https://arxiv.org/html/2606.17391#S4.F2)plots the collapse\. N\-VSSM stays nearly flat\.

Table 2:Plot\-Beat F1 across horizonsh∈\{10,20,50,100,200\}h\\in\\\{10,20,50,100,200\\\}\. Frontier and reasoning systems collapse by about−0\.20\-0\.20F1; N\-VSSM stays nearly flat\.![Refer to caption](https://arxiv.org/html/2606.17391v1/figures/fig1_horizon_collapse.png)Figure 2:Horizon collapse\. Plot\-beat F1 for frontier and reasoning systems falls by about−0\.20\-0\.20fromh=10h=10toh=200h=200, while N\-VSSM holds plot\-beat F1≥0\.84\\geq 0\.84across all horizons\.

## 5N\-VSSM

### 5\.1Architecture

N\-VSSM augments a Mamba\-2 8B decoder with an explicit narrative latentzt∈ℝ256z\_\{t\}\\in\\mathbb\{R\}^\{256\}that is updated once per scene\. At each scene boundary, an event extractor produces a tupleet=\(actor,action,object,location,outcome\)e\_\{t\}=\(\\text\{actor\},\\text\{action\},\\text\{object\},\\text\{location\},\\text\{outcome\}\)\. The latent posterior is

qϕ​\(zt∣zt−1,et,ht\)=𝒩​\(μϕ,diag⁡\(σϕ2\)\),q\_\{\\phi\}\(z\_\{t\}\\mid z\_\{t\-1\},e\_\{t\},h\_\{t\}\)=\\mathcal\{N\}\\bigl\(\\mu\_\{\\phi\},\\operatorname\{diag\}\(\\sigma^\{2\}\_\{\\phi\}\)\\bigr\),\(1\)wherehth\_\{t\}is the Mamba\-2 hidden state\. Generation conditions onztz\_\{t\}via cross\-attention into a low\-rank adapter inserted at every fourth Mamba\-2 block\. Table[3](https://arxiv.org/html/2606.17391#S5.T3)reports inference compute\.

Table 3:Inference compute, measured in H100\-seconds per episode and relative to GPT\-5\. N\-VSSM runs at roughly 4x lower per\-episode cost than the closed\-frontier band\.
### 5\.2Training

We pretrain the Mamba\-2 8B backbone on a deduplicated 480B\-token English fiction mixture, then fine\-tune it jointly with the latent module on 1\.8M serialized\-fiction scenes under strict series\-level held\-out splits\. The loss is a per\-scene negative ELBO with KL annealing plus a foreshadowing\-payoff auxiliary loss\. Training ran on 384 H100 GPUs for 9\.4 days\.

### 5\.3Cultural Transfer Function

For each non\-English target languagellwe learn a residual transformTl:ℝ256→ℝ256T\_\{l\}:\\mathbb\{R\}^\{256\}\\to\\mathbb\{R\}^\{256\}, a 2\-layer MLP trained on 24k parallel \(English,ll\) scene pairs with a contrastive loss plus a divergence penalty\. The transform shifts the latent into the target culture’s representational region without retraining the decoder\.

## 6Experiments

### 6\.1Main result

N\-VSSM holds plot\-beat F1≥0\.84\\geq 0\.84at every horizon while running at 4x lower per\-episode inference cost\. Its largest gains are on structural metrics at long horizon: relative to the frontier band ath=200h=200, foreshadowing payoff improves by\+0\.18\+0\.18, temporal coherence by\+0\.14\+0\.14, and motif persistence by\+0\.12\+0\.12\.

### 6\.2Cross\-cultural

With the Cultural Transfer Function enabled, mean native\-speaker cultural\-fidelity rises from 4\.31 to 4\.51 for Hindi, 4\.18 to 4\.39 for Tamil, 4\.22 to 4\.42 for Telugu, and 4\.26 to 4\.49 for Marathi on the 7\-point Likert scale\. Each gain of\+0\.20\+0\.20to\+0\.23\+0\.23is significant \(p<0\.01p<0\.01, mixed\-effects model, Benjamini\-Hochberg corrected\)\.

### 6\.3Ablations

Removing the latent posterior collapses plot\-beat F1 by−0\.11\-0\.11ath=200h=200\. Replacing Mamba\-2 with a same\-parameter transformer drops F1 by−0\.06\-0\.06\. Removing the foreshadowing\-payoff auxiliary loss drops the foreshadowing\-payoff rate by−0\.13\-0\.13\.

### 6\.4Judge robustness

Plot\-beat extraction uses an ensemble of three judges \(GPT\-4o, Claude Sonnet 4\.5, Gemini\-2\.5\-Flash\) with majority voting\. The ensemble reaches Cohen’sκ=0\.78\\kappa=0\.78with human annotators over 1,200 annotated beats\. We replicate the result with an N\-VSSM\-disjoint judge subset \(Claude and Gemini only\), and the rankings are unchanged\.

## 7Writer Study

We recruited n = 12 professional audio\-drama writers \(median 7 years of experience\), compensated at $80 per hour\. The design is within\-subjects, condition\-order\-balanced, with a matched user interface and a Latin\-square comparison of N\-VSSM against Claude Opus 4\.5\. Each writer completed 20 trials \(240 total\), each co\-writing a 5\-episode continuation\. We fit a mixed\-effects logistic regression with random intercepts for writer and series and a fixed effect for trial order\.

N\-VSSM was preferred on long\-arc consistency in 71% of trials \(95% CI\[64%,77%\]\[64\\%,77\\%\]\) and rated\+1\.3\+1\.3Likert points higher on controllability \(95% CI\[\+0\.9,\+1\.7\]\[\+0\.9,\+1\.7\]\)\. The condition\-order effect was small \(β=0\.04\\beta=0\.04,p=0\.61p=0\.61\)\.

## 8Discussion

Frontier scaling alone does not break the ceiling because long\-horizon serialized fiction is a partially observed process: its latent state is not recoverable from local context\. Structured latents make long\-horizon information transferable with bounded forgetting\. Cultural alignment, in turn, is recoverable as a representational transform rather than as a property that must be retrained into the decoder\.

## 9Limitations

The openly licensed series bias the corpus toward independent productions\. We cover four Indic languages only\. There is judge\-model overlap with some evaluated systems\. The writer study has n = 12\. Pretraining is English\-centric\.

## 10Conclusion

We present a nine\-metric benchmark that exposes a−0\.20\-0\.20F1 horizon collapse across 21 systems, and N\-VSSM, a latent world model that crosses the ceiling and is preferred by professional writers\. We release the benchmark, model traces, the harness, model weights, and the Cultural Transfer Function\.

## Ethics Statement

All source material is openly licensed \(CC\-BY 4\.0 or CC\-BY\-SA 4\.0\)\. The corpus contains no personally identifiable information, and voice\-actor names are redacted\. The writer study was determined to be IRB\-exempt\. N\-VSSM weights are released under a research\-use license\.

## Reproducibility Statement

Every reported number is a mean over 5 seeds with a 95% confidence interval\. We release full hyperparameters and the exact judge prompts\. A runnable reproduction recreates every table in under 6 H100\-hours\.

## Appendix ADetailed Metric Definitions

Plot\-Beat F1 is computed over the following 14\-class Save\-the\-Cat beat taxonomy: opening image, theme stated, set\-up, catalyst, debate, break into two, B\-story, fun and games, midpoint, bad guys close in, all is lost, dark night of the soul, break into three, finale\. The remaining eight metrics are defined in Section[3\.3](https://arxiv.org/html/2606.17391#S3.SS3); each ships with a reference implementation and the direction in which higher values are better\.

## Appendix BPer\-Model Per\-Metric Tables

The full21×9×521\\times 9\\times 5matrix \(21 models, 9 metrics, 5 horizons\) is released asdata/results/all\_metrics\.parquetandtables/appendixB\_full\_matrix\.csv\. Tables[1](https://arxiv.org/html/2606.17391#S4.T1),[2](https://arxiv.org/html/2606.17391#S4.T2), and[3](https://arxiv.org/html/2606.17391#S5.T3)are slices of this matrix\.

## Appendix CWriter Study Instrument

The study used a matched\-UI co\-writing interface\. On each trial a writer co\-wrote a 5\-episode continuation under one of the two conditions \(N\-VSSM or Claude Opus 4\.5\), with condition order balanced across writers via a Latin square\. After each trial, writers gave a forced\-choice long\-arc\-consistency preference and a 7\-point Likert controllability rating\. The full instrument, including rater instructions and the calibration set, is released with the benchmark\.

## Appendix DCompute Budget

- •Pretraining: 3,600 H100\-days\.
- •Fine\-tuning: 540 H100\-hours\.
- •Cultural Transfer Function: 28 H100\-hours per language\.
- •Benchmark inference: about 4,800 H100\-hours\.
- •Reproduction: under 6 H100\-hours\.

## References

- An et al\. \[2024\]Chenxin An, Shansan Gong, Ming Zhong, Xingjian Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu\.L\-Eval: Instituting standardized evaluation for long context language models\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2024\.arXiv:2307\.11088\.
- Bai et al\. \[2024\]Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li\.LongBench: A bilingual, multitask benchmark for long context understanding\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2024\.arXiv:2308\.14508\.
- Cao et al\. \[2024\]Yong Cao, Li Zhou, Seolhwa Lee, Laura Cabello, Min Chen, and Daniel Hershcovich\.Cultural alignment in large language models: An explanatory analysis based on hofstede’s cultural dimensions\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\)*, 2024\.arXiv:2309\.12342\.
- Chen et al\. \[2024\]Wei Chen, Hannah Liu, Soyeon Park, Ankit Gupta, and Mark O\. Riedl\.Story generation as search: Planning long\-form narratives with lookahead\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2024\.arXiv:2406\.05132\.
- Dao and Gu \[2024\]Tri Dao and Albert Gu\.Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, 2024\.arXiv:2405\.21060\.
- Gu and Dao \[2024\]Albert Gu and Tri Dao\.Mamba: Linear\-time sequence modeling with selective state spaces\.In*First Conference on Language Modeling \(COLM\)*, 2024\.arXiv:2312\.00752\.
- Gu et al\. \[2022\]Albert Gu, Karan Goel, and Christopher Ré\.Efficiently modeling long sequences with structured state spaces\.In*International Conference on Learning Representations \(ICLR\)*, 2022\.arXiv:2111\.00396\.
- Hsieh et al\. \[2024\]Cheng\-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg\.RULER: What’s the real context size of your long\-context language models?In*First Conference on Language Modeling \(COLM\)*, 2024\.arXiv:2404\.06654\.
- Hu et al\. \[2025\]Jianwei Hu, Priya Anand, Lukas Müller, and Tian Zhao\.Structured\-memory transformers for long\-horizon narrative reasoning\.*Transactions of the Association for Computational Linguistics \(TACL\)*, 13, 2025\.arXiv:2502\.09981\.
- Hutchinson et al\. \[2022\]Ben Hutchinson, Negar Rostamzadeh, Christina Greaves, and Katherine Heller\.Underspecification in localization: Pitfalls in adapting language technologies across cultures\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2022\.arXiv:2210\.07313\.
- Karpinska et al\. \[2024\]Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, and Mohit Iyyer\.One thousand and one pairs: A “novel” challenge for long\-context language models\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2024\.arXiv:2406\.16264\.
- Min et al\. \[2023\]Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen\-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi\.FActScore: Fine\-grained atomic evaluation of factual precision in long form text generation\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2023\.arXiv:2305\.14251\.
- Mirowski et al\. \[2023\]Piotr Mirowski, Kory W\. Mathewson, Jaylen Pittman, and Richard Evans\.Co\-writing screenplays and theatre scripts with language models: Evaluation by industry professionals\.*Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems \(CHI\)*, 2023\.arXiv:2209\.14958\.
- Tian et al\. \[2024\]Yufei Tian, Rohan Sharma, Mei Okabe, and Nanyun Peng\.Learned latent planners for long\-form text generation\.*Transactions of the Association for Computational Linguistics \(TACL\)*, 12, 2024\.arXiv:2403\.11118\.
- Wu et al\. \[2025\]Yuning Wu, Ming Shan Hee, Zhiqing Lin, Jingyao Zhou, and Diyi Yang\.WritingBench: A comprehensive benchmark for generative writing\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2025\.arXiv:2503\.05244\.
- Yang et al\. \[2022\]Kevin Yang, Nanyun Peng, Yuandong Tian, and Dan Klein\.Re3: Generating longer stories with recursive reprompting and revision\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2022\.arXiv:2210\.06774\.
- Yang et al\. \[2023\]Kevin Yang, Dan Klein, Nanyun Peng, and Yuandong Tian\.DOC: Improving long story coherence with detailed outline control\.In*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2023\.arXiv:2212\.10077\.
- Zhang et al\. \[2024\]Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang Xu, Junhao Chen, Moo Khai Hao, Xu Han, Zhen Leng Thai, Shuo Wang, Zhiyuan Liu, and Maosong Sun\.∞\\inftybench: Extending long context evaluation beyond 100k tokens\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2024\.arXiv:2402\.13718\.

Similar Articles

MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

Hugging Face Daily Papers

MemoBench is a diagnostic benchmark for evaluating video generation models' memory consistency in dynamically changing environments, where objects disappear and reappear in updated states. It includes 360 ground-truth clips and an evaluation suite combining automated metrics with VQA-based assessment, revealing insights into memory consistency challenges.