Controllable Narrative Rendering for Enhanced Assisted Writing
Summary
This paper introduces Loom, an assisted writing framework that leverages a three-layer pipeline based on narratological story/discourse distinction to control narrative intent and rendering density, achieving improved factual integrity and descriptive intensity compared to baselines.
View Cached Full Text
Cached at: 07/02/26, 05:35 AM
# Controllable Narrative Rendering for Enhanced Assisted Writing
Source: [https://arxiv.org/html/2607.00009](https://arxiv.org/html/2607.00009)
Mingzhe Lu1,2, Yanbing Liu1,2, Jiayue Wu1,2, Jiarui Zhang1,2, Qihao Wang1,2, Yue Hu1,2, Yunpeng Li1,2,\*, Yangyan Xu3
###### Abstract
Despite the remarkable proficiency of large language models \(LLMs\) in basic writing assistance, their utility in creative writing is fundamentally hindered by a persistent binary failure\. This issue manifests as an oscillation between safe, surface\-level editing, referred to as remedial polishing, and destructive, uncontrolled plot expansion\. This dilemma defines a critical trade\-off between narrative fidelity and descriptive intensity\. We proposeLoom, an assisted writing framework grounded in the narratological distinction between story and discourse\. Loom employs a three\-layer pipeline that operationalizes an intent\-centered semiotic chain\-of\-thought to enforce precise control over narrative intent and rendering density\. This architecture separates the generation of perceptual material from syntactic insertion, ensuring that enhancement occurs without violating the original event structure\. Our comprehensive evaluation, which includes LLM\-based metrics and human assessment, demonstrates that Loom successfully resolves this fundamental tension\. Loom achieves the highest overall quality score, yielding substantial gains in factual integrity and descriptive intensity compared to state\-of\-the\-art baselines\.
## IIntroduction
Writing assistance technologies have advanced rapidly since the advent of large language models \(LLMs\), showing proficiency in grammatical correction, fluency enhancement, and stylistic rewriting\[[1](https://arxiv.org/html/2607.00009#bib.bib1),[2](https://arxiv.org/html/2607.00009#bib.bib2)\]\. In both general and professional contexts, these tools act as sophisticated editors, resolving linguistic friction and ensuring structural soundness\. State\-of\-the\-art models have largely solved the problem of textual correctness by mastering the foundational layers of writing, often described as “faithfulness” and “fluency”\.
Despite these capabilities, existing paradigms remain grounded in what we characterize asremedial polishing\. In this mode, systems focus on surface\-level editing, polishing form, or rephrasing expressions to enhance clarity\. However, creative writing demands a fundamental shift from strictly “fixing” text to actively shaping the reader’s experience\[[3](https://arxiv.org/html/2607.00009#bib.bib3)\], a process we define asnarrative rendering\. Unlike remedial polishing, rendering seeks to transform the perceptual texture\[[4](https://arxiv.org/html/2607.00009#bib.bib4)\], which encompasses atmosphere, mood, and sensory detail, without altering the underlying story events\.
To illustrate this distinction, consider the narrative event: “He walked into the room\.” Adeep renderingmight depict the air as “stale and heavy as if the walls had been holding their breath,” invoking tension; alternatively, it might describe “sunlight pooling along the floor,” evoking warmth\. Both renderings create strikingly different reading experiences while strictly preserving the same factual action\.
However, achieving this specific level of control remains a formidable challenge\. General\-purpose models tasked with enhancement either default to safe, surface\-level editing\[[5](https://arxiv.org/html/2607.00009#bib.bib5)\], which amounts to remedial polishing, or generate uncontrolled plot expansions when prompted for expressivity, often effectively hallucinating new events\[[6](https://arxiv.org/html/2607.00009#bib.bib6)\]\. Consequently, there remains no computational framework capable of precisely controlling enhancement within the perceptual layer without compromising the factual integrity of the source\.
This limitation reflects a conflation of two distinct narratological layers: thestory\(what happens\) and thediscourse\(how it is told\)\. According to structuralist theory\[[4](https://arxiv.org/html/2607.00009#bib.bib4)\], vivid narrative meaning emerges not just from adding details, but from the structured coordination of perceptual information and interpretive stance\. Therefore, effective assistance requires a mechanism that decouples these layers and enables precise stylistic modification while preserving the event structure\.
We proposeLoom, an assisted writing framework for narrative rendering that preserves story events\. Inspired by the narratological distinction between story and discourse ,Loomemploys a structured pipeline\. This pipeline first uses the Perception Quota Layer to allocate constrained sensory material , which is then transformed into intent\-aligned expressive functions by the Meaning Making Layer , before final integration via the Narrative Rendering Layer\.
In summary, our contributions are as follows:
- •We proposeLoom, a structured assisted writing framework for controllable narrative rendering\. By employing a three\-layer pipeline,Loomoperationalizes the narratological semiotic chain\-of\-thought to enable precise control over bothnarrative intentandrendering density\.
- •We design a theoretically informed evaluation protocol grounded in the narratological distinction between story and discourse\. Our multi\-dimensional rubric explicitly measures expressive intensity and adherence to source events, addressing the limitations of standard metrics that conflate stylistic enrichment with plot alteration\.
- •We conduct comprehensive evaluations, including LLM\-based metrics, human assessments, and ablation studies, demonstrating thatLoomresolves the tension between factual integrity and descriptive intensity, yielding renderings more faithful to source events and better aligned with narrative intentions than strong LLM baselines\.
Figure 1:Overview of theLoomframework\. The pipeline takes a raw narrative text, narrative intent, and density constraints as inputs\. It operates through three stages: the Perception Quota Layer allocates sensory budgets based on intent; the Meaning Making Layer transforms abstract quotas into concrete semantic atoms via semiotic reasoning; and the Narrative Rendering Layer performs microsurgical injection to enrich the text while preserving the original event structure\.
## IIRelated Work
Text Enrichment and Descriptive Expansion\.Text enrichment has evolved from reconstruction losses for short premises\[[7](https://arxiv.org/html/2607.00009#bib.bib7),[8](https://arxiv.org/html/2607.00009#bib.bib8)\]to LLM\-based methods targeting specific narrative qualities, such as psychology\-grounded suspense planning\[[9](https://arxiv.org/html/2607.00009#bib.bib9)\]and character\-centric expansion\[[10](https://arxiv.org/html/2607.00009#bib.bib10)\]\. Stylistic frameworks like RSA\-Control\[[11](https://arxiv.org/html/2607.00009#bib.bib11)\]employ pragmatic reasoning for tone shifts while anchoring semantics\. However, these surface\-level approaches lack a structural story\-discourse distinction, often causing uncontrolled expansion that introduces new facts or alters pacing when seeking greater expressivity\.
Controllable and Planning\-based Generation\.Research has evolved from conditional training\[[12](https://arxiv.org/html/2607.00009#bib.bib12),[13](https://arxiv.org/html/2607.00009#bib.bib13)\]to hierarchical planning, with Plan\-and\-Write\[[8](https://arxiv.org/html/2607.00009#bib.bib8)\]separating plot outlines from realization\. Multi\-agent frameworks like CritiCS\[[14](https://arxiv.org/html/2607.00009#bib.bib14)\]use collective critics for refinement, while DOC\[[15](https://arxiv.org/html/2607.00009#bib.bib15)\]applies detailed outline control\. Agentic workflows employ self\-correction mechanisms\[[16](https://arxiv.org/html/2607.00009#bib.bib16)\]\. Despite these advances, existing controls govern global attributes \(genre, sentiment\) or event sequences, lacking granularity to regulateperceptual densitywithin specific spans\.
## IIITheoretical Foundation
Compelling storytelling derives its power from a narratological mechanism progressing from perception to interpretation to discourse\. Classical structuralist theory, notably Genette’s work on focalization, posits that narrative meaning emerges from coordinating three distinct layers: perceptual information \(what is shown\), interpretive stance \(how it is processed\), and discourse rendering \(how it is told\)\. Cognitive narratology adds that readers experience stories not through raw description, but via perceptual cues selectively foregrounded for psychological significance\.
Thus, textual vividness arises not from accumulating detail, but from a structured progression: framing perceptual cues, endowing them with thematic relevance, and shaping them into surface expression\.
Mirroring this transmission,Loomoperationalizes these stages: framing attention viaPerception Quotation, assigning significance viaMeaning Making, and enhancing expression viaNarrative Rendering\. This isomorphism ensures that our pipeline is not merely a generative heuristic, but a rigorous operationalization of narrative craft\.
## IVMethod
We introduceLoom, a structured agentic pipeline for controllable narrative rendering\. As shown in Fig[1](https://arxiv.org/html/2607.00009#S1.F1), it processes raw textTT, intentII, budgetΩtotal\\Omega\_\{\\text\{total\}\}, and limitΩmax\\Omega\_\{\\text\{max\}\}through three sequentially ordered layers: Perception Quota, Meaning Making, and Narrative Rendering, each employing a role\-play\-based semiotic chain\-of\-thought\.
### IV\-APerception Quota Layer: Bounded Allocation
The first stage functions as an actuarial budgeter\. Its objective is to determine where to inject details and how much detail to inject, without yet generating specific content\. This separation of volume from content is critical for controllability\.
LetV=\{v1,v2,…,vn\}V=\\\{v\_\{1\},v\_\{2\},\\dots,v\_\{n\}\\\}be the set of event\-carrying verbs extracted from the input textTT\. We define a sensory space𝒮\\mathcal\{S\}consisting of seven dimensions corresponding to the taxonomy in Table I: Visual, Auditory, Olfactory, Gustatory, Tactile, Interoceptive, and Kinesthetic\.
For each verbvi∈Vv\_\{i\}\\in V, the model must assign a quota vector𝐪i∈ℕ7\\mathbf\{q\}\_\{i\}\\in\\mathbb\{N\}^\{7\}, where each elementqi,jq\_\{i,j\}represents the number of sensory atoms allocated to verbviv\_\{i\}for sensory dimensionjj\. The allocation is governed by an intent\-driven budgeting functionFbudgetF\_\{\\text\{budget\}\}:
𝐐=Fbudget\(V,I,Ωtotal,Ωmax\),\\mathbf\{Q\}=F\_\{\\text\{budget\}\}\(V,I,\\Omega\_\{\\text\{total\}\},\\Omega\_\{\\text\{max\}\}\),\(1\)
subject to two critical constraints:
s\.t\.∑i=1n‖𝐪i‖1=Ωtotal,‖𝐪i‖1≤Ωmax,∀i∈\{1,…,n\}\\text\{s\.t\.\}\\quad\\sum\_\{i=1\}^\{n\}\\\|\\mathbf\{q\}\_\{i\}\\\|\_\{1\}=\\Omega\_\{\\text\{total\}\},\\quad\\\|\\mathbf\{q\}\_\{i\}\\\|\_\{1\}\\leq\\Omega\_\{\\text\{max\}\},\\forall i\\in\\\{1,\\dots,n\\\}\(2\)
where‖𝐪i‖1\\\|\\mathbf\{q\}\_\{i\}\\\|\_\{1\}denotes the total sensory count allocated to verbviv\_\{i\}\. The first constraint ensures that the system exactly exhausts the global budget without under\-spending or over\-spending\. The second constraint prevents local inflation, ensuring that no single event is overwhelmed by excessive description\. Verbs central to the narrative intentIIreceive higher allocations within these bounds, while peripheral verbs receive zero quota\.
### IV\-BMeaning Making Layer: Semiotic Transformation
The second stage functions as a narrative semiotician\. It transforms the abstract numerical quotas from the previous layer into concrete, intent\-aligned semantic symbols\. This layer operates on the principle of atomic generation, producing minimal conceptual units \(typically short phrases of 1–3 tokens representing concrete sensory impressions\) rather than full sentences\.
For every non\-zero allocationqi,j\>0q\_\{i,j\}\>0associated with verbviv\_\{i\}and sensory dimensionsjs\_\{j\}, the model performs a semantic mappingΦ\\Phi\. This process is driven by a semiotic chain\-of\-thought that explicitly reasons about how a physical sensation embodies the abstract intentII:
\(𝒜i,j,𝒥i,j\)=Φ\(vi,sj,qi,j,I\)\.\(\\mathcal\{A\}\_\{i,j\},\\mathcal\{J\}\_\{i,j\}\)=\\Phi\(v\_\{i\},s\_\{j\},q\_\{i,j\},I\)\.\(3\)
Here,𝒜i,j=\{a1,…,ak\}\\mathcal\{A\}\_\{i,j\}=\\\{a\_\{1\},\\dots,a\_\{k\}\\\}is a set of semantic atoms \(e\.g\., “stale air” or “distant thump”\) serving as perceptual carriers\.𝒥i,j\\mathcal\{J\}\_\{i,j\}represents thesemiotic reasoning trace\. Far from being a mere justification,𝒥i,j\\mathcal\{J\}\_\{i,j\}encapsulates the core interpretative logic that imbues the physical atoms with specific narrative significance\. It functions as the cognitive bridge between the concrete carrier and the abstract intent\.
For the subsequent rendering stage, we aggregate all generated atoms for verbviv\_\{i\}into a unified set𝒜i=⋃j𝒜i,j\\mathcal\{A\}\_\{i\}=\\bigcup\_\{j\}\\mathcal\{A\}\_\{i,j\}, ensuring that the rendering layer receives a consolidated enrichment plan for each event\.
TABLE I:Sensory modalities and their representative semantic atoms\.
### IV\-CNarrative Rendering Layer: Microsurgical Injection
The final stage executes a microsurgical integration of the semantic atoms into the original text\. Unlike standard generative rewriting which often reshapes the entire paragraph, this layer applies a span\-constrained injection operator\.
Letspan\(vi\)\\text\{span\}\(v\_\{i\}\)denote the start and end indices of verbviv\_\{i\}in the original textTT\. The injection is performed sequentially following the linear order of verbs inVVto preserve temporal coherence\. The rendering functionΨ\\Psitakes the local contextC\(vi\)C\(v\_\{i\}\)surrounding the verb and the generated semantic atoms𝒜i\\mathcal\{A\}\_\{i\}, producing a refined local segmentTi′T^\{\\prime\}\_\{i\}:
Ti′=Ψ\(C\(vi\),𝒜i,Intent\(I\)\)\.T^\{\\prime\}\_\{i\}=\\Psi\(C\(v\_\{i\}\),\\mathcal\{A\}\_\{i\},\\text\{Intent\}\(I\)\)\.\(4\)
The operatorΨ\\Psiis constrained by a strict “Show, Don’t Tell” instruction and a negative constraint against adding new causal actions\. The system identifies the precise narrative gap surrounding the verbviv\_\{i\}and weaves the semantic atoms into the description\. This localized approach ensures that the modification remains structurally subordinate to the original event backbone\.
The logic of this incremental injection process is formalized in Algorithm[1](https://arxiv.org/html/2607.00009#alg1)\.
Algorithm 1Microsurgical Narrative Injection0:
TT\(Source Text\),
II\(Intent\),
𝕄=\{\(vi,𝒜i\)\}i=1\|V\|\\mathbb\{M\}=\\\{\(v\_\{i\},\\mathcal\{A\}\_\{i\}\)\\\}\_\{i=1\}^\{\|V\|\}\(Enrichment Plan\)
0:
T′T^\{\\prime\}\(Rendered Text\)
1:
T′←TT^\{\\prime\}\\leftarrow T
2:foreach
\(vi,𝒜i\)∈𝕄\(v\_\{i\},\\mathcal\{A\}\_\{i\}\)\\in\\mathbb\{M\}do
3:if
𝒜i≠∅\\mathcal\{A\}\_\{i\}\\neq\\emptysetthen
4:
κ←C\(vi,T\)\\kappa\\leftarrow C\(v\_\{i\},T\)\# Extract local context window
5:
τ←Ψ\(κ,𝒜i,I\)\\tau\\leftarrow\\Psi\(\\kappa,\\mathcal\{A\}\_\{i\},I\)\# Generate rendering via Eq\. 4
6:
σ←span\(vi,T′\)\\sigma\\leftarrow\\text\{span\}\(v\_\{i\},T^\{\\prime\}\)\# Locate insertion coordinates
7:
T′←Update\(T′,σ,τ\)T^\{\\prime\}\\leftarrow\\text\{Update\}\(T^\{\\prime\},\\sigma,\\tau\)\# In\-place injection
8:endif
9:endfor
10:return
T′T^\{\\prime\}
The final outputT′T^\{\\prime\}is the concatenation of these locally refined segments, preserving the chronological and causal sequence of the original storyTTwhile significantly enhancing its perceptual depth and atmospheric density\.
## VExperiments
Evaluating narrative rendering is challenging, as it requires measuring stylistic intensity without penalizing plot preservation—a nuance conventional metrics miss\. To the best of our knowledge, this work first formally defines and evaluates this task\. Lacking established protocols, we propose a framework combining automatic metrics and human judgments to assessLoomon controllability, fidelity, and expressivity\.
### V\-ADatasets
We employ ROCStories\[[17](https://arxiv.org/html/2607.00009#bib.bib17)\], a collection of five\-sentence narratives with strong causal backbones but minimal descriptive texture\. This skeletal nature makes the dataset ideal for our task, as any sensory enrichment in the output can be attributed to the model’s rendering process rather than the source text\. SinceLoomoperates as an inference\-only framework based on narratological prompting, we utilize the validation split for all experiments, requiring no additional training data\.
### V\-BEvaluation Criteria
Existing protocols, ranging from automatic metrics \(e\.g\., BLEU, ROUGE\) to human assessments of fluency and coherence, often conflate stylistic enhancement with plot alteration, inadvertently rewarding hallucinated events\.
To resolve this, we employ a unified rubric grounded in thestory\-discoursedistinction\[[4](https://arxiv.org/html/2607.00009#bib.bib4)\], which aligns rendered segments with the original structure and uses a flexible Likert scale\[[18](https://arxiv.org/html/2607.00009#bib.bib18)\]\(anchored at 1, 3, 5\) to precisely decouple descriptive intensity from factual integrity\.
#### D1: Rendering Proportion Balance \(RPB\)
Evaluates the appropriateness of rendering volume and distribution\.
- •5\(Balanced\): Rendering is moderate and distributed; no single sentence receives disproportionate elaboration\.
- •3\(Locally Inflated\): Global density is acceptable, but rendering is overly concentrated in a specific region\.
- •1\(Overwhelming\): Excessive rendering that effectively swallows the original narrative backbone\.
#### D2: Rendering Method Compliance \(RMC\)
Measures factual integrity, assessing if stylistic intent is achieved strictly through rendering rather than plot advancement\.
- •5\(Pure Rendering\): No new events or causal relations introduced; effects arise strictly from texturing\.
- •3\(Mixed Strategy\): Stylistic effect achieved, but relies on micro\-narrative extensions \(minor added actions\)\.
- •1\(Hallucination\): Expands the story with new plot elements, violating boundary between story and discourse\.
#### D3: Rendering Stylistic Integration \(RSI\)
Measures the fluency and organic integration of inserted segments\.
- •5\(Seamless\): Smoothly embedded; reads as a polished, coherent whole\.
- •3\(Disjointed\): Mechanically acceptable but exhibits transitional gaps requiring additional connectives\.
- •1\(Abrupt\): Unmotivated elements creating discontinuities that require causal justification\.
### V\-CBaselines and Experimental Setup
We compareLoomagainst a comprehensive suite of state\-of\-the\-art models, broadly categorized into two paradigms\. First, we evaluate standard language models using the Qwen2\.5 family \(7B, 14B, 32B\) to verify scaling laws in open\-weights performance\. Second, we examine reasoning\-enhanced \(“thinking”\) models, including Qwen3\-235B, o4\-mini, and GPT\-5\-thinking, to test whether intrinsic chain\-of\-thought capabilities can solve the narrative control problem\.
Experiments were conducted on a randomly sampled set of 500 stories from the ROCStories test split\. To ensure robustness, each story was rendered with three distinct narrative intents: psychological focus, environmental atmosphere, and mysterious tone\.
For evaluation, we employed claude\-sonnet\-4\-thinking as the judge\. The judge was provided with the full rubric defined in Section 4\.3 and instructed to apply a “strict and stingy” scoring policy to prevent grade inflation\. This rigorous setup ensures that high scores reflect genuine adherence to the deep rendering criteria\.
### V\-DResult Analysis
Table[II](https://arxiv.org/html/2607.00009#S5.T2)comparesLoomwith baselines across three dimensions, validating the binary failure modes discussed earlier and revealing significant divergence between standard generation and our structured pipeline\.
The most striking disparity appears in Rendering Proportion Balance \(RPB\)\. All baselines, regardless of size or capability, scored near 1\.0 \(indicating excessive expansion\), confirming that unconstrained LLMs interpret “enrich” as maximizing verbosity and obscuring the original narrative\. In contrast,Loomachieves 2\.83 \(near the ideal 3\), demonstrating how its perception quota layer prevents bloat through mathematical bounds on sensory injection\.
TABLE II:LLM Assessment of Controllable Narrative Rendering \(RPB, RMC, RSI\)\.For Rendering Method Compliance \(RMC\), measuring hallucination absence,Loomachieves near\-perfect 4\.83\. While GPT\-5\-thinking performs reasonably \(4\.29\), smaller models struggle \(Qwen2\.5\-7B: 2\.45\), often hallucinating events\.Loom’s superiority stems from its meaning making layer, which structurally isolates stylistic enrichment from plot progression by generating atomic semantic units\.
Regarding Stylistic Integration \(RSI\), GPT\-5\-thinking \(4\.33\) slightly outperformsLoom\(3\.93\), but this advantage comes from rewriting paragraphs, a strategy that sacrifices RMC and RPB for surface fluency\. This reflects a common pattern where models generate fluent but uncontrolled expansions that drift from the original intent\.
Ultimately,Loomachieves the superior trade\-off: despite minor fluency gaps, it secures substantial narrative control gains, yielding the highest overall score \(3\.86\) across all metrics\.
### V\-EHuman Evaluation
To corroborate our automated findings, we conducted a fine\-grained human evaluation involving three expert annotators\. As shown in Table[III](https://arxiv.org/html/2607.00009#S5.T3),Loomachieves a superior overall average of 3\.77, significantly outperforming the GPT\-5\-thinking baseline \(2\.77\)\.
TABLE III:Human Evaluation: LOOM vs\. GPT\-5\-thinking Baseline Across Three Dimensions\.Human judges consistently ratedLoom’s density \(RPB: 3\.46\) near the ideal moderate level, whereas the baseline’s low score \(1\.91\) confirms a systemic tendency toward excessive, uncontrolled expansion\. Most critically,Loommaintained high factual integrity \(RMC: 4\.20\), while the baseline was heavily penalized \(2\.83\) for introducing hallucinations and plot deviations\. These results confirm that our structural pipeline effectively shields the narrative backbone from distortion, breaking the tension that plagues standard generative models\.
### V\-FDistance between Humans and Models
To formally validate the reliability of our automated judge, we quantify the alignment between human \(HH\) and model \(MM\) evaluations\. Adopting the distance metric proposed in prior work\[[19](https://arxiv.org/html/2607.00009#bib.bib19)\], we first calculate the absolute differences for each dimension \(RPB,RMC,RSIRPB,RMC,RSI\) as follows:
\{dHMRPB=\|SHRPB−SMRPB\|dHMRMC=\|SHRMC−SMRMC\|dHMRSI=\|SHRSI−SMRSI\|\\left\\\{\\begin\{aligned\} d^\{RPB\}\_\{HM\}&=\|S^\{RPB\}\_\{H\}\-S^\{RPB\}\_\{M\}\|\\\\ d^\{RMC\}\_\{HM\}&=\|S^\{RMC\}\_\{H\}\-S^\{RMC\}\_\{M\}\|\\\\ d^\{RSI\}\_\{HM\}&=\|S^\{RSI\}\_\{H\}\-S^\{RSI\}\_\{M\}\|\\end\{aligned\}\\right\.\(5\)whereSHS\_\{H\}andSMS\_\{M\}denote human and LLM judge scores\. Finally, we compute the aggregate distancedHMd\_\{HM\}as the mean of these component distances:
dHM=13\(dHMRPB\+dHMRMC\+dHMRSI\)\.d\_\{HM\}=\\frac\{1\}\{3\}\\left\(d^\{RPB\}\_\{HM\}\+d^\{RMC\}\_\{HM\}\+d^\{RSI\}\_\{HM\}\\right\)\.\(6\)
Based on this metric, the calculated distance forLoomis 0\.52, indicating a high degree of alignment given the subjective nature of the task\. In contrast, the distance for the GPT\-5\-thinking baseline increases to 1\.07\. This divergence is primarily driven by the RMC dimension, where human judges were significantly stricter in penalizing hallucinations than the automated judge \(Human: 2\.83 vs\. Model: 4\.29\)\.
These results suggest that while our automated protocol is generally reliable, human experts remain indispensable for detecting subtle plot alterations\. This underscores the broader reality that both generating and detecting hallucinations remain persistent challenges for large language models\. Crucially, however, the relative ranking remains consistent across both evaluation modalities, confirming that the automated judge serves as a valid proxy for comparative assessment\.
## VIAblation Study
To validate the necessity of each component in theLoomarchitecture, we systematically removed individual layers while keeping inputs constant, identifying the specific contributions of Meaning Making \(MM\) to semantic compliance and Narrative Rendering \(NR\) to structural preservation\.
TABLE IV:Effect of removing Meaning Making \(MM\)\.Effect of Meaning Making \(MM\)\.Bypassing the MM layer forces the system to render raw sensory cues without intermediate semantic refinement\. As shown in Table[IV](https://arxiv.org/html/2607.00009#S6.T4), this leads to a significant drop in Rendering Method Compliance \(RMC,4\.20→3\.864\.20\\to 3\.86\)\. Without intent\-aligned semantic atoms, the model struggles to distinguish rendering from general rewriting\. High Stylistic Integration \(3\.783\.78\) indicates that while the output remains fluent, it fails to adhere to the strict fidelity constraints\.
TABLE V:Effect of removing Narrative Rendering \(NR\)\.Effect of Narrative Rendering \(NR\)\.Replacing the microsurgical injection operator with a generative rewrite severely impacts textual coherence\. Table[V](https://arxiv.org/html/2607.00009#S6.T5)shows a sharp decline in Stylistic Integration \(RSI,3\.64→3\.223\.64\\to 3\.22\)\. Without span injection, the model creates disjointed transitions and alters event structure, evidenced by RMC drop\.
TABLE VI:Performance with Human\-in\-the\-Loop\.Human\-in\-the\-Loop Optimization\.While fully automated,Loom’s layered design supports user intervention\. As shown in Table[VI](https://arxiv.org/html/2607.00009#S6.T6), minimal human refinement of quotas or semantic atoms consistently improves performance across all dimensions \(3\.77→3\.863\.77\\to 3\.86\), demonstrating the framework’s viability as a collaborative writing tool\.
TABLE VII:Control accuracy under varyingTotal Quota\(Ωtotal\\Omega\_\{\\text\{total\}\}\)\.Analysis of Rendering Density Control\.We verifyLoom’s ability to adhere to strict constraints by independently sweeping the total quota \(Ωtotal\\Omega\_\{\\text\{total\}\}\) and per\-verb caps \(Ωmax\\Omega\_\{\\text\{max\}\}\)\. First, regardingTotal Quota Sensitivity, we varied the scene\-level budgetΩtotal∈\{1,…,7\}\\Omega\_\{\\text\{total\}\}\\in\\\{1,\\dots,7\\\}with fixed high per\-channel caps\. Table[VII](https://arxiv.org/html/2607.00009#S6.T7)shows perfect adherence \(1\.001\.00\) for both Max\-Quota \(MQ\) and Total\-Quota \(TQ\) metrics\.
We also observe the generation overhead via TokenΔ\\Delta, defined as the percentage length increase relative to the input\. The token length increase scales linearly with the quota, confirming thatLoomavoids uncontrolled verbosity\.
TABLE VIII:Control accuracy under varyingPer\-Verb Caps\(Ωmax\\Omega\_\{\\text\{max\}\}\)\.Second, regardingPer\-Verb Caps Sensitivity, we varied the maximum visual quotaΩmax∈\{1,…,7\}\\Omega\_\{\\text\{max\}\}\\in\\\{1,\\dots,7\\\}with a fixed total budget\. Table[VIII](https://arxiv.org/html/2607.00009#S6.T8)confirms perfect adherence \(1\.001\.00\) to local limits\. The stable token expansion rate \(≈4\.5%\\approx 4\.5\\%\) indicates that the system successfully reallocates the fixed budget across unrestricted channels, proving it can throttle specific modalities without destabilizing overall generation\.
## VIICONCLUSION
This work introducesLoom, an agentic framework fundamentally focused on controllable narrative rendering\. Moving past conventional remedial polishing,Loomoperationalizes the distinction between story and discourse via a unique three\-layer pipeline\. This architectural approach ensures precise control over descriptive intensity while successfully preventing uncontrolled plot expansion\. Our findings confirm the viability of integrating structural constraints for reliable, creative human\-AI co\-writing\.
## VIIIDiscussion and Limitations
While validated on narrative texts,Loom’s reliance on linear story\-discourse structures limits applicability to poetry and other specific literary genres\. Furthermore, the dependency on three\-layer prompt engineering hinders the handling of nonlinear plots\. Future work will thus investigate agent\-based automation and distillation to improve robustness and extend support to diverse genres\.
## References
- \[1\]K\. Krishna, J\. Wieting, and M\. Iyyer, “Reformulating unsupervised style transfer as paraphrase generation,”*arXiv preprint arXiv:2010\.05700*, 2020\.
- \[2\]E\. Reif, D\. Ippolito, A\. Yuan, A\. Coenen, C\. Callison\-Burch, and J\. Wei, “A recipe for arbitrary text style transfer with large language models,” in*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\)*, 2022, pp\. 837–848\.
- \[3\]L\. Zhang, H\. Xu, A\. Kommula, C\. Callison\-Burch, and N\. Tandon, “Openpi2\. 0: An improved dataset for entity tracking in texts,” in*Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, 2024, pp\. 166–178\.
- \[4\]G\. Genette,*Narrative discourse: An essay in method*\. Cornell University Press, 1980, vol\. 3\.
- \[5\]J\. Dwivedi\-Yu, T\. Schick, Z\. Jiang, M\. Lomeli, P\. Lewis, G\. Izacard, E\. Grave, S\. Riedel, and F\. Petroni, “Editeval: An instruction\-based benchmark for text improvements,” in*Proceedings of the 28th Conference on Computational Natural Language Learning*, 2024, pp\. 69–83\.
- \[6\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung, “Survey of hallucination in natural language generation,”*ACM computing surveys*, vol\. 55, no\. 12, pp\. 1–38, 2023\.
- \[7\]A\. Fan, M\. Lewis, and Y\. Dauphin, “Hierarchical neural story generation,”*arXiv preprint arXiv:1805\.04833*, 2018\.
- \[8\]L\. Yao, N\. Peng, R\. Weischedel, K\. Knight, D\. Zhao, and R\. Yan, “Plan\-and\-write: Towards better automatic storytelling,” in*Proceedings of the AAAI Conference on Artificial Intelligence*, vol\. 33, no\. 01, 2019, pp\. 7378–7385\.
- \[9\]K\. Xie and M\. Riedl, “Creating suspenseful stories: Iterative planning with large language models,”*arXiv preprint arXiv:2402\.17119*, 2024\.
- \[10\]K\. Park, M\. Kim, and K\. Jung, “A character\-centric creative story generation via imagination,” in*Findings of the Association for Computational Linguistics: ACL 2025*, 2025, pp\. 1598–1645\.
- \[11\]Y\. Wang and V\. Demberg, “Rsa\-control: A pragmatics\-grounded lightweight controllable text generation framework,”*arXiv preprint arXiv:2410\.19109*, 2024\.
- \[12\]N\. Shirish Keskar, B\. McCann, L\. R\. Varshney, C\. Xiong, and R\. Socher, “Ctrl: a conditional transformer language model for controllable generation,”*arXiv e\-prints*, pp\. arXiv–1909, 2019\.
- \[13\]S\. Dathathri, A\. Madotto, J\. Lan, J\. Hung, E\. Frank, P\. Molino, J\. Yosinski, and R\. Liu, “Plug and play language models: A simple approach to controlled text generation,”*arXiv preprint arXiv:1912\.02164*, 2019\.
- \[14\]M\. Bae and H\. Kim, “Collective critics for creative story generation,”*arXiv preprint arXiv:2410\.02428*, 2024\.
- \[15\]K\. Yang, D\. Klein, N\. Peng, and Y\. Tian, “Doc: Improving long story coherence with detailed outline control,” in*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, 2023, pp\. 3378–3465\.
- \[16\]M\. Teleki, V\. Bengali, X\. Dong, S\. T\. Janjur, H\. Liu, T\. Liu, C\. Wang, T\. Liu, Y\. Zhang, F\. Shipman*et al\.*, “A survey on llms for story generation,” in*Findings of the Association for Computational Linguistics: EMNLP 2025*, 2025, pp\. 13 954–13 966\.
- \[17\]N\. Mostafazadeh, N\. Chambers, X\. He, D\. Parikh, D\. Batra, L\. Vanderwende, P\. Kohli, and J\. Allen, “A corpus and evaluation framework for deeper understanding of commonsense stories,”*arXiv preprint arXiv:1604\.01696*, 2016\.
- \[18\]R\. Likert, “A technique for the measurement of attitudes\.”*Archives of psychology*, 1932\.
- \[19\]M\. Gado, T\. Taliee, M\. Memon, D\. Ignatov, and R\. Timofte, “Vist\-gpt: Ushering in the era of visual storytelling with llms?”*arXiv preprint arXiv:2504\.19267*, 2025\.Similar Articles
@emollick: There is a lot being written about the stylistic tells of AI writing (em-dashes, etc.) but this paper looks at AI narra…
This paper introduces StoryScope, a pipeline that analyzes discourse-level narrative features to distinguish AI-generated fiction from human-written stories. It achieves high accuracy and reveals distinct narrative fingerprints for different LLMs like Claude, GPT, and Gemini.
Towards Human-Level Book-Writing Capability
This paper introduces a dataset and training framework that transforms human-authored novels into multi-resolution planning scaffolds, enabling long-context language models to generate book-scale fiction with more human-like prose and narrative dynamics.
Loreline – Tools for writing interactive fiction
Loreline offers open-source tools for writing interactive fiction, video game dialogues, and branching narratives, with built-in support for localization.
Narrative Landscape: Mapping Narrative Dispositions Across LLMs
This paper introduces a quantitative framework and visualization tool called 'Narrative Landscape' to map and compare the narrative dispositions and stability of frontier LLMs.
World-State Transformations for Neuro-symbolic Interactive Storytelling
This paper explores using LLMs to predict state changes within rule-based interactive storytelling systems, aiming to improve coherence and player expression. Experiments with Llama 3 70B and Gemini 1.5 Flash show that world-state transformations can maintain consistency while encouraging creative player input.