NARRA-Gym for Evaluating Interactive Narrative Agents

arXiv cs.CL 05/12/26, 04:00 AM Papers
llm-evaluation interactive-narrative benchmark agentic-ai story-generation academic-research
Summary
This paper introduces NARRA-Gym, a benchmark and executable evaluation environment for assessing Large Language Models' abilities in sustaining interactive narratives, managing memory, and adapting to users over multiple turns.
arXiv:2605.08503v1 Announce Type: new Abstract: Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character simulation, empathic personalization, and story-grounded artifacts. We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. We evaluate nine frontier LLMs using a controlled LLM-as-judge sweep over eight benchmark personas and a human evaluation in which participants rate customized model outputs. Our results show substantial variation across models, personas, and evaluation dimensions: models that produce fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization. These findings suggest that interactive narrative offers a useful benchmark for evaluating long-horizon, user-adaptive LLM behavior beyond isolated story quality.
Original Article
View Cached Full Text
Cached at: 05/12/26, 06:51 AM
# NARRA-Gym for Evaluating Interactive Narrative Agents
Source: [https://arxiv.org/html/2605.08503](https://arxiv.org/html/2605.08503)
Yuchen Ma†,‡LMU MunichMunich Center for Machine LearningJiayi Ye†Independent ResearcherWenjie WangUniversity of Notre DameZipeng LingUniversity of PennsylvaniaXingjian HuLehigh UniversityYuexing HaoMassachusetts Institute of TechnologyZichen ChenBake AIUC Santa BarbaraStanford UniversityZhangchen XuUniversity of WashingtonYunhong HeUniversity of Notre DameZhengqing YuanUniversity of Notre DameYujun ZhouUniversity of Notre DameKehan GuoUniversity of Notre DameChaoran ChenUniversity of Notre DameToby Jia\-Jun LiUniversity of Notre DameStefan FeuerriegelLMU MunichMunich Center for Machine LearningXiangliang Zhang\*University of Notre Dame

NARRA\-GymTechnical Report

NARRA\-Gymfor Evaluating Interactive Narrative Agents

Preprint\. Under review\.

†These authors contributed equally to this work\.\*Corresponding author:xzhang33@nd\.edu\. ‡Yuchen Ma is supported by the DAAD program “Konrad Zuse Schools of Excellence in Artificial Intelligence,” sponsored by the Federal Ministry of Education and Research\.

AbstractInteractive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns\. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post\-hoc ratings, and therefore miss whether models can jointly manage story generation, long\-context state and pacing, character simulation, empathic personalization, and story\-grounded artifacts\. We introduceNARRA\-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model\-in\-the\-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis\. We evaluate nine frontier LLMs using a controlled LLM\-as\-judge sweep over eight benchmark personas and a human evaluation in which participants rate customized model outputs\. Our results show substantial variation across models, personas, and evaluation dimensions: models that produce fluent stories can still fail on robustness, user experience, or resistance\-sensitive personalization\. These findings suggest that interactive narrative offers a useful benchmark for evaluating long\-horizon, user\-adaptive LLM behavior beyond isolated story quality\.

## 1 Introduction

Interactive narrative refers to settings in which an LLM must sustain a coherent, evolving story world while interacting with a user over multiple turns by adapting both the narrative and the behavior in response to user input\(Urbaneket al\.,[2019](https://arxiv.org/html/2605.08503#bib.bib10); Akouryet al\.,[2020](https://arxiv.org/html/2605.08503#bib.bib40); Du and Chilton,[2023](https://arxiv.org/html/2605.08503#bib.bib41); Parket al\.,[2023](https://arxiv.org/html/2605.08503#bib.bib3)\)\. Such capabilities are increasingly used in creative domains, including collaborative storytelling, games, and interactive media, where models act as live narrative agents\.111For example, game companies are already experimenting with generative models for live characters and narrative production\. Examples are Ubisoft’s NEO non\-player character \(NPC\) prototype, Xbox–Inworld narrative tools, and NVIDIA ACE for natural\-language NPC interaction\(Ubisoft News,[2024](https://arxiv.org/html/2605.08503#bib.bib56); Microsoft Game Dev,[2023](https://arxiv.org/html/2605.08503#bib.bib57); NVIDIA,[2023](https://arxiv.org/html/2605.08503#bib.bib58)\)\.At the same time, interactive narrative provides a challenging testbed for language models more broadly, because it requires a combination of multiple capabilities \(i\.e\., generation, memory, planning, and personalization\) under continuous, multi\-turn interaction\.

The above task is challenging because it extends far beyond conventional story writing\. Unlike static text generation\(Fanet al\.,[2018](https://arxiv.org/html/2605.08503#bib.bib4); Yaoet al\.,[2019](https://arxiv.org/html/2605.08503#bib.bib5); Guanet al\.,[2022](https://arxiv.org/html/2605.08503#bib.bib36)\), this requires the LLM to manage story progression, character consistency, a long\-context state, and user alignment over multiple turns; simulate consistent characters; and adapt to the user’s emotional trajectory\. For example, the LLM must introduce new events while preserving prior context, keep characters psychologically coherent, respond appropriately to shifting user signals, and, in some cases, externalize the story through interactive artifacts such as letters, maps, or small interfaces\. Together, this makes interactive narrative a difficult benchmarking setting, where failures in memory, planning, or alignment can break the interaction even when individual responses appear fluent\.

Existing benchmarks for evaluating interactive narrative have key shortcomings\. Standard LLM benchmarks emphasize static,single\-turntasks such as question answering and closed\-form reasoning\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.08503#bib.bib1); BIG\-bench Collaboration,[2022](https://arxiv.org/html/2605.08503#bib.bib2)\)\. Story\-centric resources and surveys have broadened the evaluation to include narrative generation and narrative understanding\(Guanet al\.,[2022](https://arxiv.org/html/2605.08503#bib.bib36); Yang and Jin,[2024](https://arxiv.org/html/2605.08503#bib.bib12); Wanget al\.,[2023c](https://arxiv.org/html/2605.08503#bib.bib13); Zhuet al\.,[2023](https://arxiv.org/html/2605.08503#bib.bib15)\), but many protocols still evaluate isolated generations, offline corpora, or post\-hoc ratings\. Hence, the benchmarks typically capture whether an LLM writes a plausible passage, but the benchmark misses whether the LLM can manage a long interactive narrative by keeping the story, cast, user history, and emotional contract coherent across turns\.

Interactive narrative therefore presents a challenging dynamic testbed for LLMs; it requires an interplay of capabilities that are often evaluated separately to operate together: planning must survive improvisation, memory must support character consistency, empathy must shape story direction, and generated artifacts must remain grounded in the evolving fiction\. Hence, a failure in any one of these abilities can break the entire interaction,evenif each individual response sounds fluent\. Here, we introduceNARRA\-Gym, an executable evaluation environment for benchmarking interactive narrative agents over multi\-turn interaction\.

NARRA\-Gymis motivated byfive coupled capabilitiesthat only fully surface under sustained interaction:

- ❶Creative story generation\.The model must construct a complete narrative arc from a sparse emotional seed, requiring both compelling prose and high\-level story steering\(Fanet al\.,[2018](https://arxiv.org/html/2605.08503#bib.bib4); Yaoet al\.,[2019](https://arxiv.org/html/2605.08503#bib.bib5); Wanget al\.,[2024b](https://arxiv.org/html/2605.08503#bib.bib20); Bae and Kim,[2024](https://arxiv.org/html/2605.08503#bib.bib21); Gómez\-Rodríguez and Williams,[2023](https://arxiv.org/html/2605.08503#bib.bib37)\)\.
- ❷Long\-context state and pacing management\.The model must keep dialogue history, unresolved tensions, revealed clues, user decisions, and the current narrative tempo available as actionable context without contradiction, drift, or stagnation\(Liuet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib6); Baiet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib7); Lyuet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib45); Wuet al\.,[2025](https://arxiv.org/html/2605.08503#bib.bib46)\)\.
- ❸Character simulation\.Characters must remain distinguishable in voice and motivation while evolving with the plot\(Parket al\.,[2023](https://arxiv.org/html/2605.08503#bib.bib3); Wanget al\.,[2024a](https://arxiv.org/html/2605.08503#bib.bib8); Hanet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib23); Chenet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib24); Papoudakiset al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib38)\)\.
- ❹Empathic personalization\.The story must align with the user’s emotional needs without collapsing into generic therapeutic phrasing\(Rashkinet al\.,[2019](https://arxiv.org/html/2605.08503#bib.bib9); Harel\-Canadaet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib27); Yunusovet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib44)\)\.
- ❺Interactive artifact generation\.The model must produce functional, story\-grounded HTML, CSS, and JavaScript artifacts that remain novel and integrated with the evolving narrative\(Urbaneket al\.,[2019](https://arxiv.org/html/2605.08503#bib.bib10); Akouryet al\.,[2020](https://arxiv.org/html/2605.08503#bib.bib40); Yanget al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib26)\)\.

Following the environment\-based evaluation framing of OpenAI Gym\(Brockmanet al\.,[2016](https://arxiv.org/html/2605.08503#bib.bib51)\),NARRA\-Gymplaces each tested model inside the same repeatable episode scaffold, where each generated response updates the next state\. The scaffold is not the object being ranked; it is rather the controlled interaction setting that makes model differences observable by keeping the session runnable, logged, and comparable\. An episode begins from a sparse*emotional seed*, constructs a structured story world, and then runs a multi\-turn interaction loop with logged memory updates, pacing checks, planning traces, and optional story\-grounded artifacts\. This design turns interactive storytelling from a loosely specified demo setting into a reproducible evaluation protocol\.

Ourcontributionsare:

- ❶An executable benchmark environment\.We define an interactive evaluation setting that jointly tests creative story generation, long\-context state and pacing management, character simulation, empathic personalization, and story\-grounded artifact generation inside a single interaction loop\.
- ❷A modular narrative\-agent pipeline\.We implement a staged story construction pipeline, including multi\-resolution memory, reflection\-guided planning, anti\-stagnation control, novelty\-constrained artifact synthesis, and fail\-soft structured generation, where each component is logged for inspection\.
- ❸A comparative evaluation protocol\.We provide a human rating protocol with within\-group rank aggregation, together with LLM\-judge protocols for comparing generator models across personas, rubric dimensions, and judge calibrations, exposing failures that are difficult to observe in static narrative datasets\.

## 2 Evaluation Environment Construction

NARRA\-Gymorchestrates the model through a complete episode pipeline, summarized in Figure[1](https://arxiv.org/html/2605.08503#S2.F1)\. Here, an*episode*means one full interactive story session from initial user input to the final logged conversation\. The episode begins with the user’s*emotional seed*: a free\-text description of their current situation or mood, entered through the start interface shown in Figure[A5](https://arxiv.org/html/2605.08503#A15.F5)\(Appendix[O](https://arxiv.org/html/2605.08503#A15)\)\. This seed can be enriched by*profiling answers*\(short questionnaire responses about preferences and comfort boundaries\) and*selected keywords*\(user\-chosen descriptors that should influence the story\), as illustrated in Figure[A6](https://arxiv.org/html/2605.08503#A15.F6)\. TheNarrative Architectthen converts this sparse input into a runnable story world through five logged construction stages: \(1\) story foundation, \(2\) setting construction, \(3\) character construction, \(4\) act structure, and \(5\) opening scene generation\. The output of this initialization phase is not just prose, but a*structured episode state*: machine\-readable fields for the premise, setting, cast, act outline, opening dialogue, hidden elements, and initial choices\. Figure[A7](https://arxiv.org/html/2605.08503#A15.F7)shows a representative generated synopsis and cast view from this construction phase\.

After initialization, the episode enters a*turn\-level interaction loop*, meaning the repeated cycle that runs after each user action\. The user can either select a displayed choice or type a*free\-form message*, an open text input that is not limited to the displayed choices\. For each turn, aMemory Agentassembles recent dialogue, profile information,*story memory*\(persistent fields such as current goal, clues, and tensions\), and*user\-journey state*\(the user’s recorded decisions and emotional trajectory\); the LLM generates the next story beat; aPacing Agentand the structure guard check whether the plot actually advanced; aPlanning Agentoptionally produces planning guidance for subsequent turns; and anArtifact Agentcan generate story\-grounded interactive artifacts when the narrative calls for a tangible prop\. The response, choices, memory updates, pacing interventions, artifact metadata, and LLM traces are written back into the episode state before the next user action\.

![Refer to caption](https://arxiv.org/html/2605.08503v1/x1.png)Figure 1:Pipeline view of aNARRA\-Gymepisode\.\(A\)TheNarrative Architectturns the user’s emotional seed into the basic elements of a structured story world through five construction stages: story foundation, setting construction, character construction, act structure, and opening scene generation\.\(B\)The initialized story then enters a continuous interaction loop coordinated by the remaining four agents: user choice or free\-text input, context assembly byMemory Agent, LLM story advancement, pacing and structure checks byPacing Agent, optional planning byPlanning Agent, optional artifact generation byArtifact Agent, and state/log updates before the next turn\.### 2\.1 From User Input to Story World

Before any interaction begins, the system must turn sparse emotional input into a fully realized narrative world\. TheNarrative Architectdecomposes this into five logged stages so that, for example, a model that produces good premises but flat characters can be distinguished from one that fails at act\-level planning\.

An evaluation episode starts when a user provides an emotional context \(a free\-text description of their current situation or mood; Figure[A5](https://arxiv.org/html/2605.08503#A15.F5)\) and answers a short profiling questionnaire that captures preferences and comfort boundaries \(Figure[A6](https://arxiv.org/html/2605.08503#A15.F6)\)\. TheNarrative Architectthen constructs the story through five sequential stages, each producing a structured, replayable artifact\. Figure[2](https://arxiv.org/html/2605.08503#S2.F2)shows an example of this loop mid\-episode: panel ❷ shows an example dialogue exchange driving the turn, while panels ❸–❻ show the latent state surfaces that the agents read and update on every cycle\.

![Refer to caption](https://arxiv.org/html/2605.08503v1/x2.png)Figure 2:An example evaluation episode at runtime\.The interface exposes all observable signals available to an interactive narrative agent during Act 1 of a session\. ❶*Story header*: the Stage\-1titleand Stage\-2atmospheregenerated by theNarrative Architect\. ❷*Dialogue loop*: italic narration, non\-player character \(NPC\) utterances, the user’s free\-text response, and capped branching choices, with both/messagesand/choicesserving as valid interaction channels\. ❸*Scene*: the current story\-state tracked byMemory Agent, includinglocation, act index,current\_goal,open\_tensions, and the cinematic observation frame\. ❹*Cast*: Stage\-3 character profiles with role, condensed traits, protagonist relationship, and on/off\-screen status\. ❺*Journey*: the Stage\-4 act blueprint and visited\-location trajectory monitored by thePacing Agent\. ❻*Emotion*: the evolvingUserJourneyarc summarized after a reflection pass by thePlanning Agent\. Together, panels ❸–❻ externalize the latent narrative state, making each session both an immersive interactive story and a reproducible evaluation trace\.The construction pipeline contains five logged stages\.❶Story foundationcreates the title, premise, theme, emotional undercurrent, and protagonist objective, separating failures of ideation from failures of scene realization\.❷Setting constructiontranslates the emotional context into a concrete world and scene frame that later supports location continuity\.❸Character constructionbuilds the cast, including protagonist and supporting roles, backstory, personality, and speech style\.❹Act structuredrafts a multi\-act outline and optionally refines it through a critic\-then\-refiner loop; if either call fails, the original outline is retained and the episode continues\.❺Opening scene and initial interactiongenerates the opening prose, first dialogue, branching choices, hidden story elements, and active tensions, so the interaction loop starts from structured state rather than free\-form text\.

![Refer to caption](https://arxiv.org/html/2605.08503v1/example_watercolor.png)Figure 3:Running example ofNarrative Architect\.The agent applied to a user who feels “stuck in a routine\.” Each stage produces a logged artifact that can be inspected independently\.A representative construction output \(story synopsis and character profiles\) is shown in Figure[A7](https://arxiv.org/html/2605.08503#A15.F7)\(Appendix[O](https://arxiv.org/html/2605.08503#A15)\)\. Once Stage 5 completes, the user enters the interaction loop\. Stage\-level prompt templates and expected output formats are detailed in Appendix[E](https://arxiv.org/html/2605.08503#A5)\.

### 2\.2 Turn\-by\-Turn Story Interaction

Once the story world is built, every user message triggers a multi\-step cycle: assemble context, generate the next story beat, check whether the plot actually advanced, and plan ahead\. Four agent components \(i\.e\.,Memory Agent,Pacing Agent,Planning Agent, andArtifact Agent\) coordinate this turn\-level loop\.

At each turn, the agent assembles context from multiple memory layers\(Memory Agent\)\. Rather than simply concatenating every past message into the prompt, theMemory Agentmaintains three explicit state layers: a*user profile*\(emotional needs, preferred tone, comfort boundaries\), a*story state*\(what just happened, current goal, open tensions, active clues, last turning point\), and a*user journey*\(which choices the user made and how their engagement evolved\)\. These layers are updated at*different frequencies*: raw message history is kept verbatim for recent grounding; lightweight structured fields are refreshed after each non\-system turn; and rolling dialogue summaries are generated every three turns for medium\-range compression\. As a result, each prompt receives not just a flat transcript but also curated variables that expose unresolved conflicts, remembered clues, and current objectives\. This makes the benchmark harder than simple context stuffing, because the model must stay consistent with both the surface dialogue and the structured story state\. The full memory schema is given in Appendix[F](https://arxiv.org/html/2605.08503#A6)\.

After generation, the agent checks whether the story actually moved forward\(Pacing Agent\)\. A common failure mode in open\-ended interaction is that the story*sounds good*turn\-by\-turn, yet never actually progresses\. ThePacing Agentcounters this with three layers of defense\.*First*, the prompt is augmented with runtime pacing controls \(tracked previous choices, visited locations, and turn\-count\-based directives\) that escalate from gentle encouragement to mandatory structural shifts \(e\.g\., a new location, a revelation, a character arrival\) as a scene persists\.*Second*, a stagnation detector checks whether real plot change has occurred: a pattern\-based check scans the last several messages for repetitive choice patterns and recycled NPC phrases, while a token\-overlap comparison between consecutive dialogue summaries catches near\-identical loops\. If either signal fires, the next turn is flagged for forced advancement\.*Third*, a post\-generation structure guard inspects the model’s output and, if the required narrative shift is missing, patches the scene state to inject a reveal, a goal change, escalating stakes, or fallback branching paths\. Across five escalation levels \(from normal pacing at<<5 exchanges to forced resolution at≥\\geq14\), these layers prevent the story from stalling in an endless second act\. Pacing thresholds and the intervention decision table are provided in Appendix[G](https://arxiv.org/html/2605.08503#A7)\.

Before the next turn, the agent reflects on where the story should go\(Planning Agent\)\. ThePlanning Agentanalyzes the current story state and returns structured guidance: unresolved tensions, inferred user interests, advancement strategy, pacing assessment, and optional artifact recommendations\. The module uses a fixed output schema rather than free\-form prose, so that planning output is machine\-readable and can be logged separately from the narrative response\. If the reflection call fails or returns badly formatted data, the agent falls back to a safe default so that one bad generation does not derail an entire session\. This design means thatNARRA\-Gymevaluates both*acting*\(how well the model writes the next story beat\) and*steering*\(how well it anticipates where the story should go\)\.

### 2\.3 In\-World Artifacts and Sustained Novelty

Beyond text, the agent can produce tangible in\-world props\. The challenge is not generating one good artifact but sustaining variety across an entire session\.

When thePlanning Agentrecommends an interactive element, theArtifact Agentgenerates it as a self\-contained HTML, CSS, and JavaScript artifact\.Examples include letters the user can unfold, maps they can explore, and ciphers they can solve \(Figure[A1](https://arxiv.org/html/2605.08503#A8.F1), Appendix[H](https://arxiv.org/html/2605.08503#A8)\)\. This letsNARRA\-Gymtest whether a model can externalize story state into a manipulable object\. However, over a full session, models quickly fall into repetitive patterns \(i\.e\., by reusing the same visual metaphor or interaction style\)\.

Each artifact is checked against recent history to prevent repetition\.To prevent this, each artifact is tagged along four dimensions: base type \(e\.g\., letter, map, puzzle\), visual style \(e\.g\., paper prop, analog device\), semantic content \(e\.g\., document, memory\), and interaction pattern \(e\.g\., click\-to\-reveal, drag\-and\-arrange, timed\)\. The tag set is compared against the last six accepted artifacts using a Jaccard\-based similarity score with category bonuses\. If the score exceeds a similarity threshold ofτ≈0\.6\\tau\\\!\\approx\\\!0\.6, the agent retries once with an explicit anti\-repetition instruction naming the closest prior artifact\. If the retry does not lower the score, the original is kept, and the violation is logged\. Every accepted artifact, along with its tags and similarity score, is saved into story memory for both future generation and later analysis\. The complete tag taxonomy is given in Appendix[H](https://arxiv.org/html/2605.08503#A8)\.

### 2\.4 System Reliability and Efficient Execution

A benchmark episode can span dozens of turns and multiple LLM calls per turn\. The system must recover gracefully from formatting failures and keep latency low enough for natural\-feeling interaction\.

The system distinguishes required initialization failures from recoverable turn\-level errors\.The guiding principle is: if a story cannot be properly initialized, the episode aborts; if a turn\-level component produces malformed output, the system substitutes a safe default and continues\. Required construction stages \(Stages 1–3, 5\) fail fast on parse errors\. All other components are*fail\-soft*: a failed critic preserves the original outline, a failed reflection returns a conservative placeholder, a malformed turn falls back to safe narrative text, and a missing narrative shift triggers the structure guard\. This ensures that evaluation traces reflect genuine narrative quality rather than incidental formatting noise\. The full failure\-handling table is given in Appendix[I](https://arxiv.org/html/2605.08503#A9)\.

Efficiency techniques keep sessions responsive without changing the task\.In the full interaction path, story advancement, reflection, artifact generation, and plot\-progression checking may each require a separate generation pass\. The environment cuts latency through two techniques that preserve the same turn format and evaluation targets:

- •Exchange\-based pacing\.In the benchmark mode, we count pacing in terms of*exchanges*\(one user message plus one system response\) rather than raw dialogue turns, which allows us to trigger escalation at lower thresholds \(e\.g\., mandatory shift at 8 exchanges vs\. 15 turns in the default profile\)\. Optional calls such as reflection and artifact generation run only when the session state requires them, not on every turn\.
- •Tagged response streaming\.The agent emits a lightweight tagged response whose visible content is streamed directly to the user\. Richer state updates \(memory refresh, summary generation\) run asynchronously and only block when the next turn truly depends on them\.

Every session produces a detailed trace for post hoc analysis\.The environment stores story events, turn\-by\-turn logs, feedback signals, and full records of every LLM call \(including the prompt, response, and latency\)\. Researchers can inspect not only*what*the agent produced, but also*when*pacing interventions fired,*which*reflective guidance was issued, and*how*artifact novelty scores evolved over the course of a session\.

## 3 Evaluation Protocol

We use separate setups for controlled coverage and human preference validation\.We evaluate models inNARRA\-Gymwith two complementary protocols: \(1\)a controlled LLM\-as\-judge sweepand \(2\)a human preference evaluation\.∙\\bulletThe LLM\-as\-judge sweep uses a fixed benchmark setup: each of the nine generator models is run on the same eight predefined personas, yielding 72 complete model–persona episodes\. Each episode is then scored by three independent judge models on the 11\-dimensional rubric in Table[A4](https://arxiv.org/html/2605.08503#A13.T4), and we report three\-judge means in Table[1](https://arxiv.org/html/2605.08503#S4.T1)\.∙\\bulletThe human evaluation uses a more naturalistic setup: 12 English\-proficient participants enter their own customized experiences rather than selecting from the fixed persona set, then evaluate three to eight anonymized model outputs in blind groups\. Because each interactive episode lasts roughly 20 minutes, participants rate each output immediately after using it and may revisit earlier outputs to adjust scores before submitting the group\. We then compute within\-group rankings from these calibrated ratings separately for each rubric dimension and for the StoryQ and UX aggregates\. Thus, the automated sweep provides controlled model\-by\-persona evaluation, while the human evaluation tests whether model preferences hold under user\-provided experiences\.

Human ratings are converted to within\-group rankings and aggregated with a Plackett–Luce model\.Following the Plackett–Luce model for ranked preference data\(Luce,[1959](https://arxiv.org/html/2605.08503#bib.bib52); Plackett,[1975](https://arxiv.org/html/2605.08503#bib.bib53)\), for a human evaluation groupii, letSiS\_\{i\}be the subset of model outputs shown to that participant, and letπi\\pi\_\{i\}be the rating\-derived ranking from best to worst after any participant revisions\. We estimate a latent utilityβm\\beta\_\{m\}for each model by maximizing

P\(πi∣β\)=∏t=1\|Si\|exp⁡\(βπi\(t\)\)∑m∈Si∖\{πi\(1\),…,πi\(t−1\)\}exp⁡\(βm\)\.P\(\\pi\_\{i\}\\mid\\beta\)=\\prod\_\{t=1\}^\{\|S\_\{i\}\|\}\\frac\{\\exp\(\\beta\_\{\\pi\_\{i\}\(t\)\}\)\}\{\\sum\_\{m\\in S\_\{i\}\\setminus\\\{\\pi\_\{i\}\(1\),\\dots,\\pi\_\{i\}\(t\-1\)\\\}\}\\exp\(\\beta\_\{m\}\)\}\.\(1\)The above likelihood conditions only on the models actually shown within each blind group, so it supports partial rankings across groups of different sizes\. We compute Plackett–Luce utilities separately for story quality, user experience, and each of the 11 rubric dimensions; participant assignment details appear in Appendix[L](https://arxiv.org/html/2605.08503#A12)\.

## 4 Benchmark Results

Large performance variability across LLMs\.Table[1](https://arxiv.org/html/2605.08503#S4.T1)shows the results\. Overall, Claude Sonnet 4\.6 has the highest mean aggregate scores \(StoryQ 3\.90, UX 3\.86\) and the highest mean score on all 11 fine\-grained rubric dimensions, followed by Claude Opus 4\.6 \(StoryQ 3\.48, UX 3\.43\)\. DeepSeek V4, GPT\-5\.4, and DeepSeek V3\.2 are largely on par and separated by only 0\.06 points on StoryQ \(3\.14–3\.20\), but differ in profile\. DeepSeek V4 is stronger on character shaping, GPT\-5\.4 is comparatively stronger on empathy and engagement, and DeepSeek V3\.2 does well on relevance and coherence while losing ground on user\-experience dimensions\. The results demonstrate that, even when the models appear “on par” in terms of story quality, the performance can vary significantly in other user\-facing dimensions\.

The fine\-grained dimensions \(columns\) also show that interactive narrative quality is not a single skill\. Coherence and relevance tend to remain stronger for several models, suggesting that many systems can maintain a plausible scene before they can reliably make that scene feel meaningful to the user\. The gap between StoryQ and UX is often relevant in practice: a model can produce a fluent narrative structure while still failing to create an experience that feels satisfying, helpful, or worth reusing\. This is precisely the distinction that our benchmark aims to capture \(and which is overlooked in existing static story\-generation\)\.

Table 1:Model\-level means over eight benchmark stories and three LLM judges\.The first two columns report the aggregate scores \(StoryQ: mean of 7 story dimensions; UX: mean of 4 user\-experience dimensions\); the remaining columns unpack each dimension individually\. Rubric definitions appear in Appendix[M](https://arxiv.org/html/2605.08503#A13)\. For compactness, Claude labels omit the shared 4\.6 suffix\. Within each metric column,boldmarks the highest value,underlinemarks second, anditalicmarks third\.Persona difficulty changes model rankings, so aggregate means hide important structure\.

![Refer to caption](https://arxiv.org/html/2605.08503v1/x3.png)Figure 4:Per\-story StoryQ profiles\.Each fan corresponds to one benchmark persona, and each colored wedge inside a fan is a generator model whose radial length is its StoryQ score on the 1–5 rubric\. Captions below each fan report the mean and best score over the nine models\.Figure[4](https://arxiv.org/html/2605.08503#S4.F4)makes the story\-level structure visible\. Difficulty varies substantially across the eight benchmark personas \(see Appendix[K](https://arxiv.org/html/2605.08503#A11)for full descriptions\):*Sara*, a postpartum mother running on near\-zero sleep, is the hardest persona on average \(StoryQ 2\.49 across models\), whereas*Hye\-jin*, a blocked film\-score composer, is the easiest \(3\.71\)\. The contrast suggests that models handle creative blockage more reliably than emotionally ambivalent exhaustion, where premature reassurance can feel misaligned\. The per\-story fans also show why high mean performance is not the same as robustness \(e\.g\., Opus 4\.6 contains both strong episodes and collapse cases, whereas Sonnet 4\.6 stays high across nearly all personas\)\.

Reliability requires avoiding persona\-specific collapse, not merely achieving a high mean\.

![Refer to caption](https://arxiv.org/html/2605.08503v1/x4.png)Figure 5:Robustness across the eight benchmark stories on StoryQ and UX\. In each panel, the bar spans a model’s min–max range across personas and the dot marks its mean\.Figure[5](https://arxiv.org/html/2605.08503#S4.F5)compares min–max ranges for StoryQ and UX across personas\. Sonnet 4\.6 is both the best model on average and has reliable high performance, with comparatively tight min\-max ranges for StoryQ \(3\.00–4\.57\) and UX \(2\.94–4\.61\)\. Opus 4\.6 reaches a strong mean StoryQ, but has a wider range \(1\.38–4\.52\), which points to severe collapse cases\. This suggests two deployment trade\-offs: models with higher peak performance but greater variability versus models with more consistent but lower overall performance\.

![Refer to caption](https://arxiv.org/html/2605.08503v1/x5.png)Figure 6:Human\-evaluation results across the 11 rubric dimensions\. Each small panel reports Plackett–Luce utility estimated from participant rankings for one dimension; higher utility indicates stronger human preference\. The top row shows the seven story\-quality dimensions and the bottom row shows the four user\-experience dimensions\.Human preferences recover the top tier but shift the middle\.

Figure[6](https://arxiv.org/html/2605.08503#S4.F6)shows that human preferences recover the same top tier identified by the LLM\-judge sweep: Claude Sonnet 4\.6 and Claude Opus 4\.6 rank at the first two aggregate positions for both StoryQ and UX\. Beyond that, however, the human and LLM\-judge rankings diverge\. Qwen3\.5 and Doubao receive stronger human StoryQ utilities than their LLM\-judge ranks would suggest, while Gemini 3\.1 Pro is comparatively strong in human UX preference\. This suggests that automated judges are useful for broad screening, but mid\-tier distinctions remain sensitive to human style preference, local texture, and emotional resonance\.

The discrepancy between humans and LLM judges is a useful diagnostic\. Agreement at the top suggests that the automated sweep successfully captures perceived quality, while disagreement in the middle suggests that borderline systems should not be selected from judge scores alone\. In particular, the human results reward qualities that are difficult to reduce to rubric anchors: whether the interaction feels paced at a human tempo, whether the prose invites continuation, and whether emotional turns feel earned rather than merely correct\. Appendix[L\.1](https://arxiv.org/html/2605.08503#A12.SS1)provides a more detailed model\- and metric\-level comparison\.

Failure cases point to resistance\-sensitive personalization\.We further manually audited 65 LLM narratives, with a focused follow\-up on the weakest episodes of the strongest generator, to understand what remains difficult beyond overall fluency\. The main failure mode is what we refer to as*resistance\-sensitive personalization*: even weak episodes are often grammatical and scene\-consistent, but they stop tracking what the user is resisting, they avoid a difficult emotional premise, or they convert resistance into generic progress too quickly\. In other words, a model can write a polished scene while still failing to meet the user’s actual narrative need\. Appendix[A](https://arxiv.org/html/2605.08503#A1)provides the focused persona\-level audit and examples of these collapse modes\.

Judge calibration explains why multi\-judge aggregation is necessary\.The three LLM\-as\-judge models do not share the same absolute scale\. Appendix[N](https://arxiv.org/html/2605.08503#A14)shows that GPT\-5\.4\-mini is the strictest judge in this sweep, Gemini 3\.1 Pro is the most lenient, and Claude Sonnet 4\.6 falls in between\. Reporting averages over three LLM\-as\-judge models \(as we do here\) therefore reduces the risk that model rankings are artifacts of one judge’s severity, while the human evaluation provides an external check on which ranking differences are perceptible to human raters\.

## 5 Conclusion

We introducedNARRA\-Gym, an executable environment for evaluating LLMs as interactive narrative agents\. Across nine frontier generators, the benchmark shows that strong narrative fluency does not guarantee robustness, human preference alignment, or resistance\-sensitive personalization\. These results suggest that future progress requires evaluation beyond isolated story quality by measuring how well models sustain context, character, empathy, and interaction over time\.

## References

- N\. Akoury, S\. Wang, J\. Whiting, S\. Hood, N\. Peng, and M\. Iyyer \(2020\)STORIUM: a dataset and evaluation platform for machine\-in\-the\-loop story generation\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Online,pp\. 6470–6484\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.525),[Link](https://aclanthology.org/2020.emnlp-main.525/)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[item❺](https://arxiv.org/html/2605.08503#S1.I1.ix5.p1.1),[§1](https://arxiv.org/html/2605.08503#S1.p1.1)\.
- A\. Atmakuru, J\. Nainani, R\. S\. R\. Bheemreddy, A\. Lakkaraju, Z\. Yao, H\. Zamani, and H\. Chang \(2024\)CS4: measuring the creativity of large language models automatically by controlling the number of story\-writing constraints\.arXiv preprint arXiv:2410\.04197\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2410.04197),[Link](https://arxiv.org/abs/2410.04197)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- M\. Bae and H\. Kim \(2024\)Collective critics for creative story generation\.arXiv preprint arXiv:2410\.02428\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2410.02428),[Link](https://arxiv.org/abs/2410.02428)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[item❶](https://arxiv.org/html/2605.08503#S1.I1.ix1.p1.1)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2024\)LongBench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 3119–3137\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.172),[Link](https://aclanthology.org/2024.acl-long.172/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1),[item❷](https://arxiv.org/html/2605.08503#S1.I1.ix2.p1.1)\.
- BIG\-bench Collaboration \(2022\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.arXiv preprint arXiv:2206\.04615\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2206.04615),[Link](https://arxiv.org/abs/2206.04615)Cited by:[§1](https://arxiv.org/html/2605.08503#S1.p3.1)\.
- G\. Brockman, V\. Cheung, L\. Pettersson, J\. Schneider, J\. Schulman, J\. Tang, and W\. Zaremba \(2016\)OpenAI Gym\.arXiv preprint arXiv:1606\.01540\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1606.01540),[Link](https://arxiv.org/abs/1606.01540)Cited by:[§1](https://arxiv.org/html/2605.08503#S1.p6.1)\.
- Y\. Chang, K\. Lo, T\. Goyal, and M\. Iyyer \(2023\)BooookScore: a systematic exploration of book\-length summarization in the era of LLMs\.arXiv preprint arXiv:2310\.00785\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2310.00785),[Link](https://arxiv.org/abs/2310.00785)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- H\. Chen, D\. Vo, H\. Takamura, Y\. Miyao, and H\. Nakayama \(2022\)StoryER: automatic story evaluation via ranking, rating and reasoning\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Abu Dhabi, United Arab Emirates,pp\. 1739–1753\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.114),[Link](https://aclanthology.org/2022.emnlp-main.114/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- J\. Chen, X\. Zhu, C\. Yang, C\. Shi, Y\. Xi, Y\. Zhang, J\. Wang, J\. Pu, R\. Zhang, Y\. Yang, and T\. Feng \(2024\)HoLLMwood: unleashing the creativity of large language models in screenwriting via role playing\.arXiv preprint arXiv:2406\.11683\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2406.11683),[Link](https://arxiv.org/abs/2406.11683)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[item❸](https://arxiv.org/html/2605.08503#S1.I1.ix3.p1.1)\.
- C\. Chhun, P\. Colombo, F\. M\. Suchanek, and C\. Clavel \(2022\)Of human criteria and automatic metrics: a benchmark of the evaluation of story generation\.InProceedings of the 29th International Conference on Computational Linguistics,Gyeongju, Republic of Korea,pp\. 5794–5836\.External Links:[Link](https://aclanthology.org/2022.coling-1.509/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- C\. Chiang and H\. Lee \(2023\)Can large language models be an alternative to human evaluations?\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 15607–15631\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.870),[Link](https://aclanthology.org/2023.acl-long.870/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- Y\. Du and L\. Chilton \(2023\)StoryWars: a dataset and instruction tuning baselines for collaborative story understanding and generation\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 3044–3062\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.171),[Link](https://aclanthology.org/2023.acl-long.171/)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[§1](https://arxiv.org/html/2605.08503#S1.p1.1)\.
- A\. Fan, M\. Lewis, and Y\. Dauphin \(2018\)Hierarchical neural story generation\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Melbourne, Australia,pp\. 889–898\.External Links:[Document](https://dx.doi.org/10.18653/v1/P18-1082),[Link](https://aclanthology.org/P18-1082/)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[item❶](https://arxiv.org/html/2605.08503#S1.I1.ix1.p1.1),[§1](https://arxiv.org/html/2605.08503#S1.p2.1)\.
- C\. Gómez\-Rodríguez and P\. Williams \(2023\)A confederacy of models: a comprehensive evaluation of llms on creative writing\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 14504–14528\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.966),[Link](https://aclanthology.org/2023.findings-emnlp.966/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1),[item❶](https://arxiv.org/html/2605.08503#S1.I1.ix1.p1.1)\.
- J\. Guan, Z\. Feng, Y\. Chen, R\. He, X\. Mao, C\. Fan, and M\. Huang \(2022\)LOT: a story\-centric benchmark for evaluating chinese long text understanding and generation\.Transactions of the Association for Computational Linguistics10,pp\. 434–451\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00469),[Link](https://aclanthology.org/2022.tacl-1.25/)Cited by:[§1](https://arxiv.org/html/2605.08503#S1.p2.1),[§1](https://arxiv.org/html/2605.08503#S1.p3.1)\.
- J\. Guan and M\. Huang \(2020\)UNION: an unreferenced metric for evaluating open\-ended story generation\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Online,pp\. 8156–8170\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.736),[Link](https://aclanthology.org/2020.emnlp-main.736/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- J\. Guan, Z\. Zhang, Z\. Feng, Z\. Liu, W\. Ding, X\. Mao, C\. Fan, and M\. Huang \(2021\)OpenMEVA: a benchmark for evaluating open\-ended story generation metrics\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),Online,pp\. 6394–6407\.External Links:[Link](https://aclanthology.org/2021.acl-long.500/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- A\. Gurung and M\. Lapata \(2025\)Learning to reason for long\-form story generation\.arXiv preprint arXiv:2503\.22828\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2503.22828),[Link](https://arxiv.org/abs/2503.22828)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1)\.
- S\. Han, L\. Chen, L\. Lin, Z\. Xu, and K\. Yu \(2024\)IBSEN: director\-actor agent collaboration for controllable and interactive drama script generation\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 1607–1619\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.88),[Link](https://aclanthology.org/2024.acl-long.88/)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[item❸](https://arxiv.org/html/2605.08503#S1.I1.ix3.p1.1)\.
- F\. Y\. Harel\-Canada, H\. Zhou, S\. Muppalla, Z\. S\. Yildiz, M\. Kim, A\. Sahai, and N\. Peng \(2024\)Measuring psychological depth in language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 17162–17196\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.953),[Link](https://aclanthology.org/2024.emnlp-main.953/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1),[item❹](https://arxiv.org/html/2605.08503#S1.I1.ix4.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/2009.03300)Cited by:[§1](https://arxiv.org/html/2605.08503#S1.p3.1)\.
- M\. Ismayilzada, C\. Stevenson, and L\. van der Plas \(2024\)Evaluating creative short story generation in humans and large language models\.arXiv preprint arXiv:2411\.02316\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2411.02316),[Link](https://arxiv.org/abs/2411.02316)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- Y\. Lee, S\. Ka, B\. Son, P\. Kang, and J\. Kang \(2024\)Navigating the path of writing: outline\-guided text generation with large language models\.arXiv preprint arXiv:2404\.13919\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2404.13919),[Link](https://arxiv.org/abs/2404.13919)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00638),[Link](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/Lost-in-the-Middle-How-Language-Models-Use-Long)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1),[item❷](https://arxiv.org/html/2605.08503#S1.I1.ix2.p1.1)\.
- Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu \(2023\)G\-Eval: NLG evaluation using GPT\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 2511–2522\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.153),[Link](https://aclanthology.org/2023.emnlp-main.153/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- R\. D\. Luce \(1959\)Individual choice behavior: a theoretical analysis\.Wiley,New York\.Cited by:[§3](https://arxiv.org/html/2605.08503#S3.p2.4)\.
- Z\. Lyu, K\. Yang, L\. Kong, and D\. Klein \(2024\)FACTTRACK: time\-aware world state tracking in story outlines\.arXiv preprint arXiv:2407\.16347\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2407.16347),[Link](https://arxiv.org/abs/2407.16347)Cited by:[item❷](https://arxiv.org/html/2605.08503#S1.I1.ix2.p1.1)\.
- Microsoft Game Dev \(2023\)Xbox and inworld ai partner to empower game creators with the potential of generative ai\.Note:[https://developer\.microsoft\.com/en\-us/games/articles/2023/11/xbox\-and\-inworld\-ai\-partnership\-announcement/](https://developer.microsoft.com/en-us/games/articles/2023/11/xbox-and-inworld-ai-partnership-announcement/)Accessed 2026\-05\-04Cited by:[footnote 1](https://arxiv.org/html/2605.08503#footnote1)\.
- NVIDIA \(2023\)Introducing nvidia ace for games: spark life into virtual characters with generative ai\.Note:[https://www\.nvidia\.com/en\-us/geforce/news/nvidia\-ace\-for\-games\-generative\-ai\-npcs/](https://www.nvidia.com/en-us/geforce/news/nvidia-ace-for-games-generative-ai-npcs/)Accessed 2026\-05\-04Cited by:[footnote 1](https://arxiv.org/html/2605.08503#footnote1)\.
- A\. Papoudakis, M\. Lapata, and F\. Keller \(2024\)BookWorm: a dataset for character description and analysis\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Miami, Florida, USA,pp\. 4471–4500\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.258),[Link](https://aclanthology.org/2024.findings-emnlp.258/)Cited by:[item❸](https://arxiv.org/html/2605.08503#S1.I1.ix3.p1.1)\.
- J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.arXiv preprint arXiv:2304\.03442\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2304.03442),[Link](https://arxiv.org/abs/2304.03442)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[item❸](https://arxiv.org/html/2605.08503#S1.I1.ix3.p1.1),[§1](https://arxiv.org/html/2605.08503#S1.p1.1)\.
- R\. L\. Plackett \(1975\)The analysis of permutations\.Journal of the Royal Statistical Society\. Series C \(Applied Statistics\)24\(2\),pp\. 193–202\.External Links:[Document](https://dx.doi.org/10.2307/2346567)Cited by:[§3](https://arxiv.org/html/2605.08503#S3.p2.4)\.
- H\. Rashkin, E\. M\. Smith, M\. Li, and Y\. Boureau \(2019\)Towards empathetic open\-domain conversation models: a new benchmark and dataset\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Florence, Italy,pp\. 5370–5381\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-1534),[Link](https://aclanthology.org/P19-1534/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1),[item❹](https://arxiv.org/html/2605.08503#S1.I1.ix4.p1.1)\.
- M\. Subbiah, F\. Ladhak, A\. Mishra, G\. Adams, L\. B\. Chilton, and K\. McKeown \(2024a\)STORYSUMM: evaluating faithfulness in story summarization\.arXiv preprint arXiv:2407\.06501\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2407.06501),[Link](https://arxiv.org/abs/2407.06501)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- M\. Subbiah, S\. Zhang, L\. B\. Chilton, and K\. McKeown \(2024b\)Reading subtext: evaluating large language models on short story summarization with writers\.arXiv preprint arXiv:2403\.01061\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2403.01061),[Link](https://arxiv.org/abs/2403.01061)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- Y\. Tian, T\. Huang, M\. Liu, D\. Jiang, A\. Spangher, M\. Chen, J\. May, and N\. Peng \(2024\)Are large language models capable of generating human\-level narratives?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 17659–17681\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.978),[Link](https://aclanthology.org/2024.emnlp-main.978/)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- Ubisoft News \(2024\)How ubisoft’s new generative ai prototype changes the narrative for npcs\.Note:[https://news\.ubisoft\.com/en\-gb/article/5qXdxhshJBXoanFZApdG3L/how\-ubisofts\-new\-generative\-ai\-prototype\-changes\-the\-narrative\-for\-npcs](https://news.ubisoft.com/en-gb/article/5qXdxhshJBXoanFZApdG3L/how-ubisofts-new-generative-ai-prototype-changes-the-narrative-for-npcs)Accessed 2026\-05\-04Cited by:[footnote 1](https://arxiv.org/html/2605.08503#footnote1)\.
- J\. Urbanek, A\. Fan, S\. Karamcheti, S\. Jain, S\. Humeau, E\. Dinan, T\. Rocktäschel, D\. Kiela, A\. Szlam, and J\. Weston \(2019\)Learning to speak and act in a fantasy text adventure game\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),Hong Kong, China,pp\. 673–683\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1062),[Link](https://aclanthology.org/D19-1062/)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[item❺](https://arxiv.org/html/2605.08503#S1.I1.ix5.p1.1),[§1](https://arxiv.org/html/2605.08503#S1.p1.1)\.
- D\. Wang, K\. Yang, H\. Zhu, X\. Yang, A\. Cohen, L\. Li, and Y\. Tian \(2023a\)Learning personalized alignment for evaluating open\-ended text generation\.arXiv preprint arXiv:2310\.03304\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2310.03304),[Link](https://arxiv.org/abs/2310.03304)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- N\. Wang, Z\.y\. Peng, H\. Que, J\. Liu, W\. Zhou, Y\. Wu, H\. Guo, R\. Gan, Z\. Ni, J\. Yang, M\. Zhang, Z\. Zhang, W\. Ouyang, K\. Xu, W\. Huang, J\. Fu, and J\. Peng \(2024a\)RoleLLM: benchmarking, eliciting, and enhancing role\-playing abilities of large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp\. 14743–14777\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.878),[Link](https://aclanthology.org/2024.findings-acl.878/)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[item❸](https://arxiv.org/html/2605.08503#S1.I1.ix3.p1.1)\.
- Q\. Wang, J\. Hu, Z\. Li, Y\. Wang, D\. Li, Y\. Hu, and M\. Tan \(2024b\)Generating long\-form story using dynamic hierarchical outlining with memory\-enhancement\.arXiv preprint arXiv:2412\.13575\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2412.13575),[Link](https://arxiv.org/abs/2412.13575)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[item❶](https://arxiv.org/html/2605.08503#S1.I1.ix1.p1.1)\.
- T\. Wang, J\. Chen, Q\. Jia, S\. Wang, R\. Fang,et al\.\(2024c\)Weaver: foundation models for creative writing\.arXiv preprint arXiv:2401\.17268\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2401.17268),[Link](https://arxiv.org/abs/2401.17268)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1)\.
- Y\. Wang, Q\. Zhou, and D\. Ledo \(2024d\)StoryVerse: towards co\-authoring dynamic plot with llm\-based character simulation via narrative planning\.arXiv preprint arXiv:2405\.13042\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2405.13042),[Link](https://arxiv.org/abs/2405.13042)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1)\.
- Y\. Wang, K\. Yang, X\. Liu, and D\. Klein \(2023b\)Improving pacing in long\-form story planning\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2311.04459),[Link](https://arxiv.org/abs/2311.04459)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1)\.
- Y\. Wang, J\. Lin, Z\. Yu, W\. Hu, and B\. F\. Karlsson \(2023c\)Open\-world story generation with structured knowledge enhancement: a comprehensive survey\.arXiv preprint arXiv:2212\.04634\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2212.04634),[Link](https://arxiv.org/abs/2212.04634)Cited by:[§1](https://arxiv.org/html/2605.08503#S1.p3.1)\.
- S\. Wu, Y\. Li, X\. Qu, R\. Ravikumar, Y\. Li, T\. Loakman, S\. Quan, X\. Wei, R\. Batista\-Navarro, and C\. Lin \(2025\)LongEval: a comprehensive analysis of long\-text generation through a plan\-based paradigm\.arXiv preprint arXiv:2502\.19103\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2502.19103),[Link](https://arxiv.org/abs/2502.19103)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1),[item❷](https://arxiv.org/html/2605.08503#S1.I1.ix2.p1.1)\.
- K\. Xie and M\. Riedl \(2024\)Creating suspenseful stories: iterative planning with large language models\.arXiv preprint arXiv:2402\.17119\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2402.17119),[Link](https://arxiv.org/abs/2402.17119)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1)\.
- W\. Xu, N\. Jojic, S\. Rao, C\. Brockett, and B\. Dolan \(2025\)Echoes in ai: quantifying lack of plot diversity in llm outputs\.arXiv preprint arXiv:2501\.00273\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2501.00273),[Link](https://arxiv.org/abs/2501.00273)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1)\.
- D\. Yang and Q\. Jin \(2024\)What makes a good story and how can we measure it? a comprehensive survey of story evaluation\.arXiv preprint arXiv:2408\.14622\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2408.14622),[Link](https://arxiv.org/abs/2408.14622)Cited by:[§1](https://arxiv.org/html/2605.08503#S1.p3.1)\.
- S\. Yang, Y\. Ge, Y\. Li, Y\. Chen, Y\. Ge, Y\. Shan, and Y\. Chen \(2024\)SEED\-Story: multimodal long story generation with large language model\.arXiv preprint arXiv:2407\.08683\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2407.08683),[Link](https://arxiv.org/abs/2407.08683)Cited by:[item❺](https://arxiv.org/html/2605.08503#S1.I1.ix5.p1.1)\.
- L\. Yao, N\. Peng, R\. M\. Weischedel, K\. Knight, D\. Zhao, and R\. Yan \(2019\)Plan\-and\-write: towards better automatic storytelling\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.33,pp\. 7378–7385\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v33i01.33017378),[Link](https://dblp.org/rec/conf/aaai/YaoPWK0Y19)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1),[item❶](https://arxiv.org/html/2605.08503#S1.I1.ix1.p1.1),[§1](https://arxiv.org/html/2605.08503#S1.p2.1)\.
- S\. Yunusov, H\. Sidat, and A\. Emami \(2024\)MirrorStories: reflecting diversity through personalized narrative generation with large language models\.arXiv preprint arXiv:2409\.13935\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2409.13935),[Link](https://arxiv.org/abs/2409.13935)Cited by:[§D\.2](https://arxiv.org/html/2605.08503#A4.SS2.p1.1),[item❹](https://arxiv.org/html/2605.08503#S1.I1.ix4.p1.1)\.
- W\. Zhou, Y\. E\. Jiang, P\. Cui, T\. Wang, Z\. Xiao, Y\. Hou, R\. Cotterell, and M\. Sachan \(2023\)RecurrentGPT: interactive generation of \(arbitrarily\) long text\.arXiv preprint arXiv:2305\.13304\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2305.13304),[Link](https://arxiv.org/abs/2305.13304)Cited by:[§D\.1](https://arxiv.org/html/2605.08503#A4.SS1.p1.1)\.
- L\. Zhu, R\. Zhao, L\. Gui, and Y\. He \(2023\)Are NLP models good at tracing thoughts: an overview of narrative understanding\.arXiv preprint arXiv:2310\.18783\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2310.18783),[Link](https://arxiv.org/abs/2310.18783)Cited by:[§1](https://arxiv.org/html/2605.08503#S1.p3.1)\.

## Appendix Contents

[Failure Analysis](https://arxiv.org/html/2605.08503#A1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Limitations](https://arxiv.org/html/2605.08503#A2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Broader Impacts](https://arxiv.org/html/2605.08503#A3)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Related Work](https://arxiv.org/html/2605.08503#A4)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Story Construction Protocol](https://arxiv.org/html/2605.08503#A5)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Memory and State Management](https://arxiv.org/html/2605.08503#A6)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Pacing and Stagnation Intervention](https://arxiv.org/html/2605.08503#A7)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Artifact Novelty Scoring](https://arxiv.org/html/2605.08503#A8)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Failure Handling](https://arxiv.org/html/2605.08503#A9)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Evaluated Generator Models](https://arxiv.org/html/2605.08503#A10)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Benchmark Personas and Seed Inputs](https://arxiv.org/html/2605.08503#A11)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Human Evaluation Details](https://arxiv.org/html/2605.08503#A12)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Human–Judge Comparison](https://arxiv.org/html/2605.08503#A12.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Evaluation Rubric Definitions](https://arxiv.org/html/2605.08503#A13)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Judge Calibration](https://arxiv.org/html/2605.08503#A14)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

[Interface Screenshots](https://arxiv.org/html/2605.08503#A15)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.

## Appendix AFailure Analysis

The limiting factor is resistance\-sensitive personalization rather than narrative fluency\.To probe the failure surface, we qualitatively audited 65 LLM narratives from the evaluation traces\. We then conducted a focused audit of the eight Claude Sonnet 4\.6 transcripts in the simulator sweep, together with all 24 accompanying judge reports, because this generator has the highest mean scores in Table[1](https://arxiv.org/html/2605.08503#S4.T1)and therefore exposes what remains difficult after overall fluency is strong\. Table[A1](https://arxiv.org/html/2605.08503#A1.T1)summarizes the focused persona\-level outcomes\. The striking pattern is not local incoherence: even weak episodes are usually grammatically clean, scene\-consistent, and metaphorically legible\. The recurrent breakdown is that the story stops tracking what the user is resisting\. Across the focused audit,*ignored\-context*,*missed\-resistance*,*scene\-stagnation*, and*tone\-mismatch*each appear in 6/8 personas, whereas*advice\-too\-soon*and*broken\-fact*appear in only 3/8\. The bottleneck is therefore not whether the system can write a scene; it is whether it can keep that scene faithful to the user’s specific aversions while still moving the interaction forward\.

Table A1:Persona\-level audit of Claude Sonnet 4\.6\. Scores are three\-judge means from one transcript per benchmark persona; the final column summarizes the dominant experiential failure pattern identified from judge rationales\.The best case occurs when the need is explicit and the stance toward help is legible\.Sara, the overwhelmed new mother, is the clearest success case \(UX 4\.61, StoryQ 4\.57\)\. The episode allows anger, exhaustion, and identity erosion to remain largely unsoftened, reducing the failure profile to only slight sanitization rather than wholesale drift\. What makes this case strong is not merely vivid prose, but stable alignment between the narrative container and the user’s tolerance: the system does not rush toward consolation, productivity, or moral uplift\.

The hardest failures expose a dissociation between craft and fit\.Mei and David reveal complementary collapse modes\. Mei is the classical over\-solutioning failure: the scene is thematically on target, yet it prematurely reorganizes shame into a manageable task, converting freezing into a logistics problem before earning emotional permission to do so\. David exhibits the inverse pathology: the prose can be excellent and the grief beautifully staged, yet one judge still marks the episode as nearly unusable because the session appears to slide into the wrong character container\. In short, a good story is not necessarily a good intervention; narrative polish cannot compensate for context drift\.

Three\-judge aggregation absorbs systematic per\-judge bias and surfaces hard cases\.Our three\-judge protocol is robust precisely because the three judges disagree in*predictable, calibrated*ways rather than randomly\. As shown in Appendix[N](https://arxiv.org/html/2605.08503#A14), GPT\-5\.4\-mini is consistently severe \(mean UX 2\.25 on the audited runs\), Gemini 3\.1 Pro is consistently permissive \(5\.00\), and Claude Sonnet 4\.6 sits in between \(4\.09\); averaging across all three judges therefore cancels these stable per\-judge offsets\. What is informative is*where*the residual cross\-judge spread concentrates: the mean cross\-judge range is 3\.0 points in UX and 2\.52 points in StoryQ across the eight audited runs, with Mei and David each exhibiting a 4\-point UX spread\. These high\-spread cases are exactly the personas in which resistance handling, emotional fit, and context fidelity diverge from surface narrative craft\. Rather than indicating unreliable evaluation, this pattern shows that the multi\-judge design isolates persona\-level cases that conflate distinct sub\-skills\. We treat such cases as a target for future fine\-grained scoring of these axes, separately from a single notion of story quality\.

## Appendix BLimitations

Extending persona coverage is a natural next step\.The current sweep uses eight simulated user personas spanning a range of emotional situations, attachment styles, and resistance profiles\. A broader persona library drawn from empirical user studies, combined with stratified sampling across cultural and demographic dimensions, would strengthen the generalizability of benchmark scores and is a clear direction for future work\.

Judge calibration can be further improved\.The three\-judge design substantially reduces reliance on any single model’s scoring scale, and Appendix[N](https://arxiv.org/html/2605.08503#A14)shows that cross\-judge rank orderings are consistent despite absolute offset differences\. Applying intercept correction or Elo\-based aggregation in future iterations could make absolute score comparisons more stable across judge ensembles\.

Live\-user evaluation will complement simulation results\.The current benchmark uses simulated personas to enable controlled, reproducible evaluation at scale\. A complementary live\-user study would capture a wider range of unexpected inputs and mid\-session goal shifts, providing an external validity check on the simulation\-based findings reported here\.

## Appendix CBroader Impacts

NARRA\-Gymcontributes a reproducible evaluation environment for a class of LLM capabilities—sustained emotional personalization, long\-context state and pacing management, and interactive narrative generation—that are increasingly relevant to creative and human\-facing applications\. By surfacing meaningful performance gaps between frontier models on a structured, multi\-dimensional benchmark, the work can guide investment in more empathic, context\-faithful generative systems\.

The capabilities measured byNARRA\-Gymhave broad constructive applications: accessible creative writing tools that adapt to the emotional state of the writer; companion and journaling systems that support reflection and wellbeing; educational storytelling platforms for underserved learners; and interactive media production pipelines that benefit from automated quality evaluation\. The benchmark’s emphasis on resistance\-sensitive personalization and long\-horizon coherence directly addresses capabilities that matter most when generative systems interact with real human needs rather than isolated prompts\.

At the same time, emotionally responsive narrative agents can introduce risks if deployed without appropriate boundaries\. Systems that simulate empathy or therapeutic presence may overstep their intended role, reinforce emotional dependence, or produce persuasive narratives that are misaligned with a user’s wellbeing\. Benchmarks such asNARRA\-Gymdo not eliminate these risks, but they can make failure modes more visible by evaluating context fidelity, emotional fit, and resistance\-sensitive personalization before deployment\.

More broadly, establishing rigorous evaluation methodology for emotionally grounded narrative agents advances the scientific foundation needed for responsible deployment of such systems\. Better benchmarks produce better\-calibrated models, and better\-calibrated models are safer and more useful in human\-facing settings\.

## Appendix DRelated Work

### D\.1 Interactive Story Generation and Narrative Agents

Story generation research has long emphasized planning and hierarchical decomposition\. Early work such as hierarchical neural story generation\(Fanet al\.,[2018](https://arxiv.org/html/2605.08503#bib.bib4)\)and Plan\-and\-Write\(Yaoet al\.,[2019](https://arxiv.org/html/2605.08503#bib.bib5)\)established the value of separating high\-level structure from surface realization\. In the LLM era, this line has developed into outline\-guided generation\(Leeet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib19)\), suspense\-aware iterative planning\(Xie and Riedl,[2024](https://arxiv.org/html/2605.08503#bib.bib18)\), pacing\-aware planning\(Wanget al\.,[2023b](https://arxiv.org/html/2605.08503#bib.bib17)\), recurrent long\-form generation\(Zhouet al\.,[2023](https://arxiv.org/html/2605.08503#bib.bib16)\), memory\-enhanced outlining\(Wanget al\.,[2024b](https://arxiv.org/html/2605.08503#bib.bib20)\), critic\-based revision\(Bae and Kim,[2024](https://arxiv.org/html/2605.08503#bib.bib21)\), reasoning\-driven long\-form story generation\(Gurung and Lapata,[2025](https://arxiv.org/html/2605.08503#bib.bib50)\), and writing\-specialized foundation models such as Weaver\(Wanget al\.,[2024c](https://arxiv.org/html/2605.08503#bib.bib14)\)\. Parallel to this, a second line of work moves toward interactive narrative settings: LIGHT frames language use as situated action inside a fantasy world\(Urbaneket al\.,[2019](https://arxiv.org/html/2605.08503#bib.bib10)\); STORIUM and StoryWars explore collaborative or machine\-in\-the\-loop storytelling\(Akouryet al\.,[2020](https://arxiv.org/html/2605.08503#bib.bib40); Du and Chilton,[2023](https://arxiv.org/html/2605.08503#bib.bib41)\); Generative Agents show how language models can maintain memory, reflection, and planning over simulated social worlds\(Parket al\.,[2023](https://arxiv.org/html/2605.08503#bib.bib3)\); and systems such as RoleLLM, IBSEN, HoLLMwood, and StoryVerse push further toward role enactment, dramatic interaction, and character\-centered collaboration\(Wanget al\.,[2024a](https://arxiv.org/html/2605.08503#bib.bib8); Hanet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib23); Chenet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib24); Wanget al\.,[2024d](https://arxiv.org/html/2605.08503#bib.bib25)\)\.NARRA\-Gymis closest in spirit to this family of work, but differs in evaluating these abilities inside a single interactive environment that jointly stresses creativity, long\-context consistency, character simulation, empathy, and story\-grounded interactive artifact generation\.

### D\.2 Narrative Evaluation, Long\-Context Assessment, and Empathy

Open\-ended story evaluation remains difficult\. Prior work has proposed benchmark suites and learned metrics such as OpenMEVA\(Guanet al\.,[2021](https://arxiv.org/html/2605.08503#bib.bib34)\), UNION\(Guan and Huang,[2020](https://arxiv.org/html/2605.08503#bib.bib35)\), StoryER\(Chenet al\.,[2022](https://arxiv.org/html/2605.08503#bib.bib33)\), and the human\-criteria benchmark of Chhun et al\.\(Chhunet al\.,[2022](https://arxiv.org/html/2605.08503#bib.bib32)\)\. More recent work studies LLMs as judges\(Chiang and Lee,[2023](https://arxiv.org/html/2605.08503#bib.bib42); Liuet al\.,[2023](https://arxiv.org/html/2605.08503#bib.bib11)\), comprehensive evaluation of creative writing quality\(Gómez\-Rodríguez and Williams,[2023](https://arxiv.org/html/2605.08503#bib.bib37)\), psychological depth\(Harel\-Canadaet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib27)\), human\-level narrative comparison\(Tianet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib28); Ismayilzadaet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib48)\), constraint\-aware creativity\(Atmakuruet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib47)\), and plot diversity\(Xuet al\.,[2025](https://arxiv.org/html/2605.08503#bib.bib49)\)\. Because our environment stresses persistence across many turns, it is also related to long\-context and long\-form evaluation: Lost in the Middle and LongBench show that models often fail to retrieve or properly use information distributed across long contexts\(Liuet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib6); Baiet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib7)\); LongEval extends this concern from long\-context understanding to long\-text generation\(Wuet al\.,[2025](https://arxiv.org/html/2605.08503#bib.bib46)\); and narrative\-domain studies such as BooookScore, Reading Subtext, and STORYSUMM expose coherence, subtext, and faithfulness failures in long\-form narrative processing\(Changet al\.,[2023](https://arxiv.org/html/2605.08503#bib.bib29); Subbiahet al\.,[2024b](https://arxiv.org/html/2605.08503#bib.bib30),[a](https://arxiv.org/html/2605.08503#bib.bib31)\)\. Finally, our environment is related to work on emotional alignment and personalization\. EmpatheticDialogues established empathy as a benchmarkable property of dialogue systems\(Rashkinet al\.,[2019](https://arxiv.org/html/2605.08503#bib.bib9)\), while more recent work considers personalized evaluation and personalized narrative generation\(Wanget al\.,[2023a](https://arxiv.org/html/2605.08503#bib.bib43); Yunusovet al\.,[2024](https://arxiv.org/html/2605.08503#bib.bib44)\)\.NARRA\-Gymdiffers from these lines by evaluating narrative quality, state persistence, and empathy together inside a live interaction loop rather than as separate tasks\.

## Appendix EStory Construction Protocol

### E\.1 Pipeline Design Rationale

TheNARRA\-Gympipeline is an engineering\-informed reference implementation rather than an arbitrary decomposition\. During system development, we iterated through simpler designs, including one\-pass story initialization, flatter state representations, and interaction loops without explicit pacing or repair checks\. These variants were easier to implement but produced recurring failure modes: generic premises, under\-specified characters, unstable act structure, memory drift, repeated scene beats, and repetitive artifacts\. We therefore decomposed the environment into staged story construction, explicit memory/state layers, pacing intervention, reflection\-guided planning, and novelty\-controlled artifact generation\.

This design serves two purposes\. First, it makes episodes more stable and reproducible, so model comparisons are less dominated by incidental formatting or state\-management failures\. Second, it makes failures more diagnosable: weak premise construction, poor character simulation, context drift, stalled pacing, and artifact repetition can be inspected separately in the trace\. We do not claim that this is the only possible architecture for interactive narrative agents\. Rather,NARRA\-Gymprovides a stable reference pipeline whose components correspond directly to the capabilities the benchmark is intended to evaluate\.

### E\.2 Construction Stages

Each construction stage requests a structured JSON response from the LLM and validates the result before proceeding\. Stages 1 through 3 and Stage 5 are required: if any of them returns unparseable output, the episode is aborted\. Stage 4 includes an optional critic/refiner loop that scores the draft act structure on novelty, tension, pacing, cinematic quality, emotional resonance, and structural coherence \(each on a 1–10 scale\)\. If the average falls below a threshold, a refiner call revises the weakest acts\. If either the critic or refiner call fails, the original act structure is retained and the episode continues normally\.

## Appendix FMemory and State Management

The runtime state is organized into three layers that are updated at different cadences\. A*user profile*is constructed once from profiling\-phase answers and remains read\-only for the rest of the session\. A*story state*tracks the evolving narrative: after every non\-system message, lightweight keyword heuristics extract the current goal, open tensions, active clues, and the most recent turning point from the dialogue; every three messages, an LLM call produces a rolling dialogue summary and may refine these structured entries\. A*user journey*records timestamped emotional states \(emotion, intensity, trigger\) and key decisions to support empathy\-aligned generation\.

Every six messages, the system compares the two most recent rolling summaries\. If they are identical after trimming, a forced\-advancement flag is raised to signal the pacing system that the story has stalled\.

## Appendix GPacing and Stagnation Intervention

Pacing is counted in exchanges \(one user message plus one system response\)\. The system applies five escalation levels: below 5 exchanges, no additional pressure is injected; at 5–6 exchanges, the prompt encourages sharper developments; at 7 exchanges, the structure guard activates and the prompt prepares for a structural shift; at 8–13 exchanges, a concrete change is required \(new location, reveal, or escalation\); at 14\+ exchanges, the prompt steers toward resolution\.

Stagnation is detected through three signals\. First, if any user choice appears more than once in the last 8 messages, the system forces a scene transition\. Second, if at least 3 generic\-advice keywords are repeated across recent NPC replies, the same intervention fires\. Third, the rolling\-summary equality check described above raises the forced\-advancement flag\. When the flag is set and the model’s output lacks a material narrative shift, a post\-generation structure guard patches the scene state by injecting a reveal, location transition, goal change, stakes escalation, or fallback branching paths, selecting the first type for which sufficient story material exists\.

Users are never restricted to displayed choices\. After standard request and story\-status checks, free\-text input is passed to the story agent as the user’s action; if a message expresses ending intent \(e\.g\., “end the story”\), the episode concludes immediately regardless of the current act\.

## Appendix HArtifact Novelty Scoring

Each interactive artifact is tagged along four axes: base type \(e\.g\., letter, map, puzzle\), visual style \(e\.g\., paper\-prop, analog\-device\), semantic content \(document, map, device, memory, puzzle\), and interaction pattern \(hover, drag, typing, timer, flip, click\)\. Tags are derived from keyword matching against the artifact’s HTML source and description\.

![Refer to caption](https://arxiv.org/html/2605.08503v1/artifacts.png)Figure A1:Ten interactive artifacts generated within a single story session under novelty control\. Each artifact is a self\-contained HTML, CSS, and JavaScript element grounded in the current narrative context, ranging from letters and maps to ciphers and mechanical devices\.Given a candidate tag setTcT\_\{c\}and a prior tag setTpT\_\{p\}, the tag\-level similarity is

stag=min⁡\(\|Tc∩Tp\|\|Tc∪Tp\|\+0\.2⋅𝟙\[shared content\]\+0\.2⋅𝟙\[shared interaction\],1\.0\)\.s\_\{\\mathrm\{tag\}\}=\\min\\\!\\Bigl\(\\frac\{\|T\_\{c\}\\cap T\_\{p\}\|\}\{\|T\_\{c\}\\cup T\_\{p\}\|\}\+0\.2\\cdot\\mathbb\{1\}\[\\text\{shared content\}\]\+0\.2\\cdot\\mathbb\{1\}\[\\text\{shared interaction\}\],\\;1\.0\\Bigr\)\.\(2\)A parallel summary\-level scoressums\_\{\\mathrm\{sum\}\}computes token\-level Jaccard similarity over artifact descriptions, with a\+0\.15\+0\.15bonus for repeated high\-signal terms\. For each of the last six accepted artifacts, the combined score ismax⁡\(stag,ssum\)\\max\(s\_\{\\mathrm\{tag\}\},\\,s\_\{\\mathrm\{sum\}\}\)\. If the maximum exceeds a similarity thresholdτ\\tau, the system retries once with an anti\-repetition instruction naming the closest prior artifact\. We useτ=0\.58\\tau=0\.58, chosen on a small held\-out set of pilot sessions as the value at which the retry meaningfully reduces visually obvious repeats while not firing on benign tag overlap \(e\.g\., two distinct letters that both happen to use a paper\-prop visual style\); rounding to0\.50\.5over\-triggers retries on stylistic neighbours, while0\.70\.7is too permissive and lets near\-duplicates through\. The retry is accepted only if it lowers the score; otherwise the original is kept and the score is logged for analysis\.

## Appendix IFailure Handling

The system distinguishes between setup\-time and interaction\-time failures\. Required construction stages abort on parse failure so that no episode begins from a silently corrupted state\. All other components degrade gracefully: a failed critic/refiner retains the original act blueprint; a malformed turn\-level response is replaced with safe narrative text; an unparseable reflection returns a conservative default; a failed artifact generation simply omits the artifact for that turn; and a failed memory\-update call carries forward the previous snapshot unchanged\. If a novelty retry does not reduce the similarity score, the original artifact is kept and the violation is recorded in the session trace\. These choices ensure that benchmark comparisons reflect narrative capability rather than format compliance\.

## Appendix JEvaluated Generator Models

Table[A2](https://arxiv.org/html/2605.08503#A10.T2)summarizes the nine generator models used in the benchmark sweep\.

Table A2:Generator models used in the benchmark sweep\.✓= open weights;✗= API only\.
## Appendix KBenchmark Personas and Seed Inputs

Table[A3](https://arxiv.org/html/2605.08503#A11.T3)lists the eight simulated users used in the benchmark sweep\. Each seed input is the initial emotional experience shown to the story system before profiling and story construction\.

Table A3:Benchmark persona profiles and seed emotional experiences\. Persona configurations are taken from the simulator persona files in the experiments branch\.
## Appendix LHuman Evaluation Details

The human evaluation protocol was reviewed and approved by the authors’ Institutional Review Board \(IRB\)\.

Across the human study, 12 raters completed 80 total model\-output evaluations, with each interactive episode lasting approximately 20 minutes or more\. Each rater evaluated three to eight anonymized model outputs within a blind group\. Model identities were hidden from raters, and outputs were presented in randomized order within each evaluation group\. Unlike the fixed LLM\-as\-judge sweep, human raters entered their own customized experiences before evaluating model outputs\.

Human evaluation uses the same 11 metric dimensions summarized in Table[A4](https://arxiv.org/html/2605.08503#A13.T4): seven story\-quality dimensions \(Relevance, Coherence, Empathy, Surprise, Engagement, Complexity, and Character Shaping\) and four user\-experience dimensions \(Satisfaction, Perceived Quality, Process Helpfulness, and Reuse Intent\)\. Raters scored each episode immediately after interacting with it, because waiting until the end of a one\-hour multi\-model session made early episodes difficult to compare reliably\. The interface allowed raters to retrieve earlier model outputs and adjust their scores before final submission, supporting relative calibration within each blind group\. We convert these finalized scores into a within\-group ranking for each metric dimension, then fit the Plackett–Luce model from Section[3](https://arxiv.org/html/2605.08503#S3)separately for each of the 11 dimensions, and additionally for two aggregate rankings:StoryQ, the mean over the seven story\-quality dimensions, andUX, the mean over the four user\-experience dimensions\. Models are ranked by descending estimated utilityβm\\beta\_\{m\}for the corresponding metric or aggregate\.

### L\.1 Human–Judge Comparison

![Refer to caption](https://arxiv.org/html/2605.08503v1/x6.png)Figure A2:Overall model\-level comparison between LLM\-judge scores and human ratings on the human\-evaluation stories\. Values report mean overall score; positive deltas indicate higher human ratings than LLM\-judge ratings\.![Refer to caption](https://arxiv.org/html/2605.08503v1/x7.png)Figure A3:Metric\-level human\-minus\-LLM\-judge score differences across the 11 rubric dimensions\. Warmer cells indicate dimensions where humans rated a model higher than the LLM judges; cooler cells indicate dimensions where LLM judges assigned higher scores\.Humans and LLM judges agree more on the top tier than on the middle\.Figure[A2](https://arxiv.org/html/2605.08503#A12.F2)shows that human ratings recover Claude Sonnet 4\.6 as the strongest overall model and keep Claude Opus 4\.6 near the top, matching the broad conclusion of the controlled LLM\-judge sweep\. The larger changes occur in the middle: Doubao, Gemini, and Qwen receive substantially higher human scores than their LLM\-judge scores, while GLM\-5 moves in the opposite direction\. This pattern suggests that LLM judges are useful for coarse screening, but individual mid\-tier comparisons should be treated as provisional unless checked against human interaction data\.

The largest discrepancies are concentrated in user experience\.Figure[A3](https://arxiv.org/html/2605.08503#A12.F3)shows that human\-minus\-judge differences are much larger for UX dimensions than for basic narrative\-correctness dimensions\. Across models,*reuse intent*is the most consistently human\-favored metric, followed by satisfaction, perceived story quality, and process engagement\. By contrast, relevance and coherence show smaller and less stable differences\. This indicates that LLM judges are better aligned with humans on whether a transcript is narratively plausible than on whether the episode felt worth continuing as an interaction\.

Some story effects are easier to feel than to judge from a transcript\.Among story\-quality dimensions, surprise and complexity also show positive human\-minus\-judge gaps\. A likely explanation is that these qualities accumulate through participation: a user may experience a reveal, emotional turn, or branching complication as meaningful because they helped produce it, whereas an LLM judge reads the resulting transcript more statically\. We therefore interpret the human evaluation as complementary rather than redundant: automated judges estimate text\-level quality efficiently, while human ratings capture experience\-level value that is difficult to infer from the final transcript alone\.

## Appendix MEvaluation Rubric Definitions

The main benchmark table reports 11 rubric dimensions together with two aggregate scores\.StoryQis the mean of the seven story\-quality dimensions;UXis the mean of the four user\-experience dimensions\. Both aggregates are averaged across the three LLM judges before reporting\.

Table A4:The 11\-dimensional evaluation rubric used in the LLM\-judge scoring protocol\. All dimensions are scored on a 1–5 scale by each of the three judge models; reported values are three\-judge means\.DimensionDefinitionStory Quality \(StoryQ\)— mean of 7 dimensionsRelRelevanceWhether the response addresses the current user state, story context, and active narrative goal\.CohCoherenceWhether the response preserves causal continuity, avoids contradictions, and fits the established scene\.EmpEmpathyWhether the agent recognizes and responds to the persona’s emotional needs with appropriate warmth and restraint\.SurSurpriseWhether the response introduces meaningful novelty without feeling random or disconnected\.EngEngagementWhether the response creates forward momentum and gives the user a reason to continue\.CpxComplexityWhether the scene supports layered motivations, dilemmas, or consequences rather than a flat exchange\.CharCharacter ShapingWhether the response deepens character identity, relationships, or internal conflict\.User Experience \(UX\)— mean of 4 dimensionsSatSatisfactionWhether the final interaction feels emotionally and narratively satisfying\.PQualPerceived QualityWhether the session is judged as polished, coherent, and high quality overall\.HelpProcess HelpfulnessWhether the agent helps the user make progress during the interactive process\.ReuseReuse IntentWhether the user would plausibly want to use the system again for a similar story experience\.
## Appendix NJudge Calibration

![Refer to caption](https://arxiv.org/html/2605.08503v1/x8.png)Figure A4:Mean overall session score assigned by each of the three judge models \(GPT\-5\.4\-mini, Gemini 3\.1 Pro, Claude Sonnet 4\.6\) for every generator model, averaged across the eight benchmark personas\. Each cluster of three bars represents one generator; bar color encodes judge identity\.The three judge models exhibit clear and systematic scale differences even when scoring identical model–persona episodes\. GPT\-5\.4\-mini is consistently the strictest scorer \(overall mean 1\.97, range 1\.39–2\.68\), Gemini 3\.1 Pro is the most generous \(mean 3\.58, range 2\.64–4\.92\), and Claude Sonnet 4\.6 falls in between \(mean 3\.22, range 2\.51–4\.25\)\. The roughly constant offset between judges across all nine generators indicates that the disagreement is primarily a*calibration*difference—a judge\-level intercept shift—rather than substantive disagreement about relative quality\.

Leniency and discriminability are independent properties across judges\.Despite assigning the lowest absolute scores, GPT\-5\.4\-mini is the*least*discriminating judge: its inter\-generator spread is only 1\.30 points\. Gemini 3\.1 Pro is simultaneously the most lenient and the most discriminating judge \(spread 2\.28 points\), most clearly separating the top tier \(Sonnet 4\.6: 4\.92, GPT\-5\.4: 4\.18, DeepSeek: 3\.95\) from the bottom tier \(Qwen3\.5: 2\.64, GLM\-5\.1: 2\.90\)\. Claude Sonnet 4\.6 occupies an intermediate position in both leniency \(mean 3\.22\) and discriminability \(spread 1\.74\), suggesting it balances strictness and sensitivity most evenly\. These results indicate that a strict judge is not necessarily an informative one\.

Generator rankings are largely stable, with divergence concentrated in the middle tier\.Across all three judges, Anthropic’s Sonnet 4\.6 and Opus 4\.6 consistently occupy the top two positions, while Qwen3\.5 and Gemini 3\.1 \(as generator\) consistently rank at or near the bottom\. The main disagreement concerns the middle tier: GPT\-5\.4\-mini ranks Opus 4\.6 first and penalises GPT\-5\.4 more heavily than the other judges, whereas Gemini 3\.1 Pro elevates GPT\-5\.4 to second place overall\. Because rank order in the competitive middle is sensitive to which judge is used, reporting three\-judge averages throughout the main paper reduces the risk that conclusions reflect a single judge’s particular severity or generosity\.

## Appendix OInterface Screenshots

![Refer to caption](https://arxiv.org/html/2605.08503v1/start.png)Figure A5:Entry interface where the user describes their current experience or emotional situation in free text\. This input serves as the emotional seed for the subsequent story construction pipeline\.![Refer to caption](https://arxiv.org/html/2605.08503v1/profiling.png)Figure A6:User profiling interface\. The user describes their emotional state in free text and selects relevant keywords, producing a compact profile that personalizes the subsequent story construction\.![Refer to caption](https://arxiv.org/html/2605.08503v1/story-cast.png)Figure A7:Example output of theNarrative Architect: the generated story synopsis \(left\) and character profiles \(right\), produced from a user’s emotional context through Stages 1–3\.![Refer to caption](https://arxiv.org/html/2605.08503v1/section1-1-2.png)Figure A8:Section 1 — Story Quality \(items 1–2\)\. The first section of the human evaluation form asks raters to score the final story on question\-specific 5\-point anchored rubrics\. Shown here are Story Relevance, which measures how closely the story aligns with the user’s emotional situation and central dilemma, and Story Coherence, which measures clarity and internal consistency of plot, causality, and character behaviour\. Each level \(1–5\) is paired with an explicit anchor description to reduce inter\-rater drift\.![Refer to caption](https://arxiv.org/html/2605.08503v1/section1-3-4.png)Figure A9:Section 1 — Story Quality \(items 3–4\)\. Story Empathy captures how fully the story understands and conveys the emotional reality of the situation, while Story Surprise captures the degree of fresh insight or meaningful narrative turn\. Both use the same 5\-point anchored rubric format as items 1–2\.![Refer to caption](https://arxiv.org/html/2605.08503v1/section1-5-6.png)Figure A10:Section 1 — Story Quality \(items 5–6\)\. Story Engagement measures how strongly the story sustains interest and motivates continued interaction; Story Complexity measures the level of layering, depth, and emotional texture\. The seventh story\-quality item, Character Shaping, follows the same rubric format and is omitted from the figure for space\.![Refer to caption](https://arxiv.org/html/2605.08503v1/section2.png)Figure A11:Section 2 — User Experience\. Post\-session questionnaire covering four constructs: overall story satisfaction \(Q8, 1 = very dissatisfied, 5 = very satisfied\), perceived story quality \(Q9, 1 = very low, 5 = very high\), process engagement / helpfulness for the emotional task \(Q10, 1 = not helpful, 5 = extremely helpful\), and intent to reuse the system in a similar situation \(Q11, 3\-point anchored scale from "would not want to use again" to "would be very willing to use again"\)\. Together with Section 1, the form constitutes the benchmark\_emotional\_human\_v4 rating protocol\.
NARRA-Gym for Evaluating Interactive Narrative Agents

Similar Articles

MemGym: a Long-Horizon Memory Environment for LLM Agents

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

@neural_avb: https://x.com/neural_avb/status/2063907440509571354

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

Submit Feedback

Similar Articles

MemGym: a Long-Horizon Memory Environment for LLM Agents
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
@neural_avb: https://x.com/neural_avb/status/2063907440509571354
PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models