Staying In Character: Perspective-Bounded Memory For Book-Based Role-Playing Agents

arXiv cs.CL 06/25/26, 04:00 AM Papers
role-playing memory-architecture character-agents llm knowledge-boundary narrative-generation benchmark
Summary
This paper proposes ReverieMem, a three-layer memory architecture for book-based LLM role-playing agents that prevents factual overreach and stylistic monotony. It also introduces the KBF-QA benchmark and achieves significant improvements in knowledge boundary fidelity and narrative quality.
arXiv:2606.25632v1 Announce Type: new Abstract: Recent LLM role-playing systems build character agents from novels by extracting characters, scenes, and relations. Yet long-narrative role-playing suffers from two failures: Factual Overreach, where shared retrieval or parametric memory lets a character use facts outside its perspective, and Stylistic Monotony, where profile descriptions flatten a character into a fixed voice. To address these failures, we propose REVERIEMEM, a three-layer memory architecture for book-based character agents. The episodic layer stores first-person scene memories; the semantic layer stores visibility-tagged facts; and the personality layer stores situation-dependent speech and behaviour patterns. For evaluation, we construct KBF-QA, a 4,386-question benchmark over eight novels for testing knowledge boundaries. REVERIEMEM improves Knowledge Boundary Fidelity by 34.6 percentage points over the strongest prior method. On BOOKWORLD's five-dimension pairwise narrative protocol, REVERIEMEM achieves a ~ 79% win rate, suggesting that perspective-bounded memory improves both boundary fidelity and character-grounded narrative generation.
Original Article
View Cached Full Text
Cached at: 06/25/26, 05:12 AM
# Staying In Character: Perspective-Bounded Memory for Book-Based Role-Playing Agents
Source: [https://arxiv.org/html/2606.25632](https://arxiv.org/html/2606.25632)
Xushuo Tang1Junhe Zhang1††footnotemark:Zihan Yang3Yifu Tang4 Sichao Li2Longbin Lai5Zhengyi Yang2 1UNSW Sydney2University of Sydney3Chang’an University 4RAIDS Lab5Tongyi Lab, Alibaba Group xushuo\.tang@unsw\.edu\.au, junhe\.zhang1@student\.unsw\.edu\.au, zihany@chd\.edu\.cn, yves\.tang@raids\-lab\.com, sichao\.li@sydney\.edu\.au, zhengyi\.yang@sydney\.edu\.au, Longbin\.lailb@alibaba\-inc\.com

###### Abstract

Recent LLM role\-playing systems build character agents from novels by extracting characters, scenes, and relations\. Yet long\-narrative role playing suffers from two failures:*Factual Overreach*, where shared retrieval or parametric memory lets a character use facts outside its perspective, and*Stylistic Monotony*, where profile descriptions flatten a character into a fixed voice\. To address these failures, we proposeReverieMem, a three\-layer memory architecture for book\-based character agents\. The episodic layer stores first\-person scene memories; the semantic layer stores visibility\-tagged facts; and the personality layer stores situation\-dependent speech and behaviour patterns\. For evaluation, we constructKBF\-QA, a 4,386\-question benchmark over eight novels for testing knowledge boundaries\.ReverieMemimproves Knowledge Boundary Fidelity by 34\.6 percentage points over the strongest prior method\. OnBookWorld’s five\-dimension pairwise narrative protocol,ReverieMemachieves a∼79%\\sim 79\\%win rate, suggesting that perspective\-bounded memory improves both boundary fidelity and character\-grounded narrative generation\.

## 1Introduction

> “The divine gift does not come from a higher power, but fromour own minds\.” — Dr\. Robert Ford,*Westworld*

LLM role playing agents have become a practical interface for character chat, narrative sandboxes, and interactive story generation\(Shao et al\.,[2023](https://arxiv.org/html/2606.25632#bib.bib16); Wang et al\.,[2024a](https://arxiv.org/html/2606.25632#bib.bib23); Tu et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib21); Chen et al\.,[2024a](https://arxiv.org/html/2606.25632#bib.bib2)\)\. Recent work builds such agents from novels by extracting characters, scenes, and relations as narrative context\(Zhao et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib29); Ran et al\.,[2025](https://arxiv.org/html/2606.25632#bib.bib14); Wang et al\.,[2025b](https://arxiv.org/html/2606.25632#bib.bib25)\); for example,BookWorldconstructs interactive agent societies from books and guides characters with profiles and story context\(Ran et al\.,[2025](https://arxiv.org/html/2606.25632#bib.bib14)\)\. A credible character agent must reason within the character’s knowledge boundary and dynamically adapt its voice and behaviour to the evolving narrative situation\. In long novels, this constraint is often broken by two OOC \(out\-of\-character\) failures, illustrated in Figure[1](https://arxiv.org/html/2606.25632#S1.F1):Factual OverreachandStylistic Monotony\.

![Refer to caption](https://arxiv.org/html/2606.25632v1/x1.png)Figure 1:Two OOC failures in long\-narrative role playing:Factual Overreach\(a\), where a character claims facts outside its perspective, andStylistic Monotony\(b\), where a character’s expression is flattened into one fixed profile\.The first failure,Factual Overreach, occurs when an agent states canonically true facts that are unavailable to the character it is playing\. This can happen when long\-narrative RAG systems retrieve from a shared book\-level memory, or when the LLM’s parametric memory supplies the fact directly\(Gutiérrez et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib7); Wang et al\.,[2026](https://arxiv.org/html/2606.25632#bib.bib22)\)\. In Figure[1](https://arxiv.org/html/2606.25632#S1.F1)\(a\), Lestrade should not confirm Holmes’s deduction about the commissionaire’s background at Baker Street: the deduction is correct, but Lestrade was not present\.

The second failure,Stylistic Monotony, occurs when profiles collapse a character into one expressive mode\. In Figure[1](https://arxiv.org/html/2606.25632#S1.F1)\(b\), Holmes, for example, is not only a cold analytical reasoner: in Doyle’s first chapter, he greets Watson as an excited experimenter \(“I’ve found it\!”\), then moments later becomes the composed deducer who infers Watson’s military past \(“Afghanistan, I perceive”\)\. A static profile may produce a recognisable Holmes while losing this situational shift\.

Addressing these failures requires memory that is neither a shared book index nor a static profile\. Factual access must be scoped to the character’s narrative position, while expression must be grounded in situated behaviour across scenes\. Three cognitive accounts guide the design: 1\)*Complementary Learning Systems*\(McClelland et al\.,[1995](https://arxiv.org/html/2606.25632#bib.bib12)\)models human memory as the cooperation between hippocampal encoding of individual experiences and neocortical consolidation of structured knowledge; 2\) Conway’s*Self\-Memory System*\(Conway and Pleydell\-Pearce,[2000](https://arxiv.org/html/2606.25632#bib.bib4)\)casts autobiographical recall as a top\-down reconstruction that locates event context first and reconstructs details within it; 3\)*Narrative identity*\(Mcadams and McLean,[2013](https://arxiv.org/html/2606.25632#bib.bib11)\)holds that identity is constituted by self\-defining episodes rather than static traits\.

We therefore presentReverieMem, a cognitive\-psychology\-inspired three\-layer memory architecture for book\-based role playing\. The*Episodic Layer*stores first\-person scene summaries from each character’s perspective; the*Semantic Layer*stores structured facts with character\-specific visibility; and the*Personality Layer*stores situation\-dependent patterns distilled from canonical speech, behaviour, and emotion transitions\. At inference time,ReverieMemanchors on the character’s own scenes, retrieves only visibility\-allowed facts, and conditions the response on an appropriate behavioural pattern\.

We evaluateReverieMemon eight novels with two complementary tasks\. For knowledge boundaries, we constructKBF\-QA\(Knowledge Boundary Fidelity\-QA\), a 4,386\-question multiple\-choice benchmark in which a character must answer facts it could know and refuse facts outside its perspective\. For open\-ended narrative quality, we follow the five\-dimension pairwise comparison protocol ofBookWorld\.ReverieMemimproves KBF by 34\.6 percentage points over the strongest prior method and achieves a∼79%\\sim 79\\%win rate in pairwise narrative comparisons\.

This work makes three contributions:

- •To our knowledge, we are the first to formalise and address two OOC failure modes in book\-based role\-playing LLM agents:*Factual Overreach*and*Stylistic Monotony*\.
- •We proposeReverieMem, a three\-layer memory architecture that separates scene experience, visibility\-gated factual knowledge, and situation\-dependent expression for book\-based character agents\.
- •We constructKBF\-QA, a large\-scale benchmark for Knowledge Boundary Fidelity, and show thatReverieMemsurpasses the strongest prior character\-agent system on bothKBF\-QAand pairwise narrative comparisons\.

## 2Related Work

### 2\.1Role Playing Agents

Role\-playing agents test whether LLMs can sustain consistent characters in dialogue and narrative tasks\. One line of work improves role consistency through persona descriptions, dialogue histories, source\-text grounding, synthetic personas, and role playing training\(Shao et al\.,[2023](https://arxiv.org/html/2606.25632#bib.bib16); Wang et al\.,[2024a](https://arxiv.org/html/2606.25632#bib.bib23),[2025a](https://arxiv.org/html/2606.25632#bib.bib24)\), with literary settings further grounding agents in book scenes and character internals\(Wang et al\.,[2025b](https://arxiv.org/html/2606.25632#bib.bib25)\)\. Benchmarks accompany this line and evaluate multi\-turn dialogue, personality fidelity, and emotional fidelity\(Tu et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib21); Wang et al\.,[2024b](https://arxiv.org/html/2606.25632#bib.bib26); Feng et al\.,[2025](https://arxiv.org/html/2606.25632#bib.bib6)\), while a recent survey separates fictional role playing from personalization\(Tseng et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib20)\)\. These works mainly test whether responses match the target persona, but provide less evidence on whether the character should know the expressed fact\.

A second line builds story worlds populated by interacting agents through sandbox simulations, multi\-agent frameworks, and novel\-to\-simulation systems\(Park et al\.,[2023](https://arxiv.org/html/2606.25632#bib.bib13); Chen et al\.,[2024b](https://arxiv.org/html/2606.25632#bib.bib3); Ran et al\.,[2025](https://arxiv.org/html/2606.25632#bib.bib14)\), alongside multi\-agent narrative\-generation systems for long\-story writing, screenwriting via role\-played authoring, and autonomous plot progression\(Xia et al\.,[2025](https://arxiv.org/html/2606.25632#bib.bib27); Chen et al\.,[2024a](https://arxiv.org/html/2606.25632#bib.bib2); Zhao et al\.,[2026](https://arxiv.org/html/2606.25632#bib.bib28)\)\. Characters in these systems are typically represented through structured profiles, relation states, or shared narrative context, and evaluation focuses on plot coherence and interaction quality rather than on whether the speaking character can access a given fact in the source narrative\. Whether a character should be allowed to access a given fact remains unaddressed\.

### 2\.2RAG & Memory for Narrative Reasoning

Memory determines what evidence is available during generation\. Retrieval\-augmented generation grounds outputs in external documents\(Lewis et al\.,[2020](https://arxiv.org/html/2606.25632#bib.bib8)\), and adaptive retrieval, hierarchical summarisation, long\-term memory, and narrative memory systems improve evidence selection and multi\-hop or long\-story reasoning\(Asai et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib1); Sarthi et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib15); Wang et al\.,[2026](https://arxiv.org/html/2606.25632#bib.bib22)\)\. For role playing over novels, however, the question is not only whether a retrieved fact is relevant but also whether the character can access it\.

A complementary line addresses the boundary of what an agent should commit to\. General LLM knowledge boundaries are catalogued in a recent survey\(Li et al\.,[2025](https://arxiv.org/html/2606.25632#bib.bib9)\); for role playing agents specifically, boundary\-aware training\(Tang et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib18)\)and representation\-level refusal editing\(Liu et al\.,[2025](https://arxiv.org/html/2606.25632#bib.bib10)\)aim to suppress out\-of\-role answers\. These works define the boundary by what the model knows or what the role constraint forbids; our setting defines it by what the character could have witnessed in the source narrative, making per\-character visibility a core design axis\.

## 3ReverieMem

![Refer to caption](https://arxiv.org/html/2606.25632v1/x2.png)Figure 2:Overview ofReverieMem\. Given a query on charactercc, the system runs three phases described in §[3](https://arxiv.org/html/2606.25632#S3): 1\)Source\-to\-Memoryextracts per\-scene text, dialogue, and emotion data via an LLM and constructs a perspective\-bounded three\-layer memory for each focus character; 2\)CLS Memory Collaborative Reasoninganchors on theEpisodic Layerscene memory𝒮c\\mathcal\{S\}\_\{c\}, which both grounds and informsSelf\-Probeto iteratively expand the fact poolℳ\(t\)\\mathcal\{M\}^\{\(t\)\}from the visibility\-boundedSemantic Layersubsetℱc\\mathcal\{F\}\_\{c\}, and finally synthesises a reasoning conclusion; 3\)Fusion Injectionselects a patternm∗m^\{\*\}from thePersonality Layerand injects it into the final memory fusion that integrates multiple memory components to produce the response fromcc’s perspective\.In this section we presentReverieMem, a perspective\-bounded three\-layer memory architecture for character role playing agents that addresses the failures identified in §[1](https://arxiv.org/html/2606.25632#S1)\.ReverieMemtherefore serves two design objectives, one for each failure:

- •Epistemic Objective\.The agent retrieves only facts the character could plausibly know in the source text, and abstains from claims that no such fact supports\.
- •Expressive Objective\.The agent’s prose matches the style the character exhibits in comparable situations in the source text, rather than a flattened trait label\.

ReverieMeminstantiates these objectives through three memory layers \(§[3\.2](https://arxiv.org/html/2606.25632#S3.SS2)\) and a perspective\-bounded inference pipeline \(§[3\.4](https://arxiv.org/html/2606.25632#S3.SS4)\)\. Figure[2](https://arxiv.org/html/2606.25632#S3.F2)sketches the architecture\.

### 3\.1Overview

ReverieMembuilds on two cognitive principles: the episodic–semantic distinction\(McClelland et al\.,[1995](https://arxiv.org/html/2606.25632#bib.bib12)\)structures its memory layers \(§[3\.2](https://arxiv.org/html/2606.25632#S3.SS2)\), and reconstructive, and scene\-anchored recall\(Conway and Pleydell\-Pearce,[2000](https://arxiv.org/html/2606.25632#bib.bib4)\)structures its inference pipeline \(§[3\.4](https://arxiv.org/html/2606.25632#S3.SS4)\)\. In addition, we add a per\-character visibility constraint enforced during memory construction: when speaking as charactercc, retrieval is bounded to the visibility\-allowed subsetℱc\\mathcal\{F\}\_\{c\}, and the agent refuses rather than searches harder whenℱc\\mathcal\{F\}\_\{c\}contains no answer\. The full system runs in three phases: an offline construction phase,*Source\-to\-Memory*\(§[3\.3](https://arxiv.org/html/2606.25632#S3.SS3)\), followed at inference by*CLS Memory Collaborative Reasoning*and*Fusion Injection*\(both in §[3\.4](https://arxiv.org/html/2606.25632#S3.SS4)\)\.

### 3\.2Bounded Three\-Layer Memory

Following the episodic–semantic distinction\(Figure[2](https://arxiv.org/html/2606.25632#S3.F2)\),ReverieMemsplits factual memory into an*episodic layer*\(scene memories\) and a*semantic layer*\(discrete facts with per\-character visibility\), and adds a*personality layer*of discrete behavioural patterns derived from each character’s conduct in the source text, selected at inference by emotion transitions\(Mcadams and McLean,[2013](https://arxiv.org/html/2606.25632#bib.bib11)\)\.

#### Episodic Layer\.

The episodic layer plays the hippocampus\-like role of fast, scene\-level encoding in CLS: it stores what the character has lived through as a perspectival frame that anchors subsequent retrieval\. For each focus charactercc, the layer maintains a corpus of first\-person scene summaries, one per sceneccwas present for, recording incc’s voice whatccdid, felt, perceived, and inferred about others\. The corpus is indexed for similarity retrieval and serves as the substrate of theAnchoroperation at inference time\. Character\-level scoping ensures every downstream retrieval starts fromcc’s lived experience\.

#### Semantic Layer\.

The semantic layer plays the neocortex\-like role of slow, cross\-episode integration: it consolidates the facts implied across scenes into a visibility\-tagged knowledge graph\. Each factf∈ℱf\\in\\mathcal\{F\}is a five\-element*SPOCV*tuplef=\(s,p,o,κ,V\)f=\(s,p,o,\\kappa,V\):\(s,p,o\)\(s,p,o\)is a standard subject–predicate–object triple;κ\\kapparecords the in\-narrative cause when one is explicit, capturing that narrative reasoning hinges on*why*an event occurred, not only*that*it did; andV⊆𝒞V\\subseteq\\mathcal\{C\}is the per\-character visibility set defined below\.

Visibility is granted along four routes —*direct experience*,*observation*,*organisational sharing*, and*common knowledge*— with assignment details deferred to §[3\.3](https://arxiv.org/html/2606.25632#S3.SS3)\. For each charactercc, the visibility\-allowed subsetℱc=\{f∈ℱ:c∈V\(f\)\}\\mathcal\{F\}\_\{c\}=\\\{f\\in\\mathcal\{F\}:c\\in V\(f\)\\\}bounds retrieval whenReverieMemspeaks ascc: facts outsideℱc\\mathcal\{F\}\_\{c\}cannot be surfaced by any amount of retrieval relevance\.

#### Personality Layer\.

The personality layer addresses stylistic monotony: a character’s behavioural style varies with situation, distinct in confrontation, in deference, and in retreat\. We therefore represent each character not as a static profile of trait adjectives but as a finite set of discrete patterns abstracted from the character’s conduct in the source text\.

For each charactercc, the layer maintains two artefacts\. The first is a set\{m1,…,mK\}\\\{m\_\{1\},\\ldots,m\_\{K\}\\\}of*personality patterns*, each pairing a short description with canonical excerpts from the source text \(e\.g\. “Lestrade in cautious deference to Holmes”\); the descriptions support selection and the excerpts serve as in\-context style anchors\. The second is a record ofcc’s*emotion transitions*, derived from per\-utterance dialogue annotations: each transition encodes an\(emotion,intensity,trigger\)\(\\text\{emotion\},\\text\{intensity\},\\text\{trigger\}\)triple together with the speaker’s inferred intent\. At inference time, the emotion record situates next\-state prediction withincc’s observed trajectory, and the selected pattern provides the behavioural register that conditions memory fusion\. Construction is described in §[3\.3](https://arxiv.org/html/2606.25632#S3.SS3)\.

### 3\.3Source\-to\-Memory

The three memory layers are constructed offline from the source novel by a scene\-anchored procedure \(Figure[2](https://arxiv.org/html/2606.25632#S3.F2), left\)\. The only human input is the focus\-character list with canonical names and aliases; the remaining annotations are elicited from a language model\.

#### Scene preparation\.

The novel is partitioned into scenes by an LLM\-guided procedure that merges contiguous text spans falling within a single dramatic unit\. Each scene is augmented with situational metadata \(location, time, atmosphere\) and a focus\-character roster that records, for each focus character, whether they are present and active, present but silent, or only referenced\. This scene\-level decomposition is the shared substrate of the three layers\.

#### Episodic summarisation\.

For each focus characterccand each scene in whichccis at least present, we elicit a first\-person retrospective summary incc’s voice\. Summaries are constructed scene by scene, without access to subsequent narrative material, so that each summary reflects only the character’s local experience at that point\. The resulting per\-character corpora are indexed by dense embedding for similarity\-based retrieval at inference time\.

#### Knowledge graph extraction\.

Following the two\-stage extract\-then\-relate paradigm common in narrative\-domain knowledge graphs\(Gutiérrez et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib7); Wang et al\.,[2026](https://arxiv.org/html/2606.25632#bib.bib22)\), we first elicit per\-scene entities spanning seven types, then induce\(s,p,o,κ\)\(s,p,o,\\kappa\)relations over them, withκ\\kapparecording the in\-narrative cause when explicit\. Entity aliases are unified across scenes, after which facts are deduplicated against the canonical entity table\. The visibility componentVVis then assigned in two phases, completing the SPOCV tuple: a local phase grants direct experience to the fact’s subject and object, observation to characters witnessing the scene of the fact, and common knowledge for world\-level background; a propagation phase extends organisational sharing by tracing each fact through a character–organisation membership graph derived from the same scenes\.

#### Personality clustering\.

We extract the dialogues attributed to each focus character on a per\-scene basis and normalise speaker labels to canonical names\. Each utterance is annotated with an\(emotion,intensity,trigger\)\(\\text\{emotion\},\\text\{intensity\},\\text\{trigger\}\)triple together with the speaker’s inferred intent; concatenated chronologically, these annotations form the character’s emotion\-transition record\. We then cluster the character’s scenes by behavioural similarity intoKKpersonality patterns\. For each pattern, we distil a one\-line description \(embedded for retrieval\) and select canonical excerpts from the source dialogues that exemplify the pattern\.

Per\-layer artefacts are shown in Appendix[B](https://arxiv.org/html/2606.25632#A2), with per\-book layer sizes in Appendix[A](https://arxiv.org/html/2606.25632#A1)\.

### 3\.4Inference Pipeline

ReverieMem’s inference pipeline realises a perspective\-bounded form of reconstructive recall\. For a queryqqdirected at charactercc, two phases run in coordination, with every retrieval restricted to the visibility\-allowed subsetℱc\\mathcal\{F\}\_\{c\}:*CLS Memory Collaborative Reasoning*retrieves a scene\-level frame fromcc’s episodic memory and iteratively fills it with visibility\-bounded facts;*Fusion Injection*closes inference with a memory fusion that incorporates a patternm∗m^\{\*\}supplied by the parallel personality track \(Figure[2](https://arxiv.org/html/2606.25632#S3.F2), middle and right\)\.

#### CLS Memory Collaborative Reasoning\.

The main reasoning chain interleaves two stages overℱc\\mathcal\{F\}\_\{c\}\.

Scene recall\.TheAnchoroperation queries the episodic\-layer corpus forccby similarity toqqand returns the matched first\-person scene summaries as the scene memory:

𝒮c=Anchor\(q,c\)\.\\mathcal\{S\}\_\{c\}=\\textsc\{Anchor\}\(q,\\ c\)\.\(1\)This is not the response toqqbut the frame within which the next stage reconstructs detail, following the Self\-Memory System’s hierarchical retrieval cascade\(Conway and Pleydell\-Pearce,[2000](https://arxiv.org/html/2606.25632#bib.bib4)\)\. Mainstream narrative RAG queries a novel\-level index directly, surfacing evidence outside the speaker’s perspective\. Anchoring oncc’s own scenes first restricts consideration tocc’s lived experience\.

Scene\-guided detail reconstruction\.The scene memory𝒮c\\mathcal\{S\}\_\{c\}guides which factual details to recall, rather than treatingℱc\\mathcal\{F\}\_\{c\}as a flat search space\. Givenqqand𝒮c\\mathcal\{S\}\_\{c\},Self\-Probegenerates a set of focused sub\-queries𝒫\(1\)\\mathcal\{P\}^\{\(1\)\}targeting the factual gaps in the scene memory, and the candidate\-fact pool starts empty:ℳ\(0\)=∅\\mathcal\{M\}^\{\(0\)\}=\\emptyset\. The reconstruction then loops over roundst=1,…,Nt=1,\\ldots,N: each round issues the current probes againstℱc\\mathcal\{F\}\_\{c\}via dense retrieval, expanding the pool

ℳ\(t\)=ℳ\(t−1\)∪Vis\-Retrieve\(𝒫\(t\),ℱc\)\.\\mathcal\{M\}^\{\(t\)\}=\\mathcal\{M\}^\{\(t\-1\)\}\\cup\\textsc\{Vis\-Retrieve\}\\bigl\(\\mathcal\{P\}^\{\(t\)\},\\ \\mathcal\{F\}\_\{c\}\\bigr\)\.\(2\)Sufficientthen checks whether\(𝒮c,ℳ\(t\)\)\(\\mathcal\{S\}\_\{c\},\\mathcal\{M\}^\{\(t\)\}\)already supports a response fromcc’s perspective; if so \(or ift=Nt=N\), the loop exits at roundτ\\tau\. OtherwiseSelf\-Probegenerates follow\-up probes targeting the residual gap, and the loop continues\. The terminal poolℳ\(τ\)\\mathcal\{M\}^\{\(\\tau\)\}is handed to the memory\-fusion stage\.

This retrieve\-evaluate\-deepen schedule resembles existing narrative\-domain RAG; what is specific to a character agent is that detail reconstruction is anchored on the speaker’s own scene memory rather than a novel\-level summary, and that every retrieval is bounded toℱc\\mathcal\{F\}\_\{c\}: factscccould not have witnessed are never admissible into the pool\.

#### Fusion Injection\.

The personality track runs in parallel with the reasoning chain and supplies a patternm∗m^\{\*\}to the closing memory fusion through three operations:Emo\-Gate,Emo\-Transition, andPattern\.

Pattern selection \(parallel\)\.Emo\-Gatejudges whether the current dialogue turn carries a salient emotional event; routine exchanges leave the emotional state in place\. When the gate fires,Emo\-Transitionreadscc’s emotion\-transition record and proposes a new state, andPatternselectsm∗m^\{\*\}from\{m1,…,mK\}\\\{m\_\{1\},\\ldots,m\_\{K\}\\\}by description\-embedding match to the new state and dialogue context\.

Perspective\-bounded memory fusion\.The closing memory\-fusion stage operates over the terminal workspace\(𝒮c,ℳ\(τ\)\)\(\\mathcal\{S\}\_\{c\},\\mathcal\{M\}^\{\(\\tau\)\}\), conditioned on the injected patternm∗m^\{\*\}\. The selected pattern’s canonical excerptsSm∗S\_\{m^\{\*\}\}and descriptiondm∗d\_\{m^\{\*\}\}are concatenated into the memory\-fusion prompt as in\-context style anchors, andMemory\-Fusionproduces a response fromcc’s perspective:

r=Memory\-Fusion\(q,c,𝒮c,ℳ\(τ\);m∗\)\.r=\\textsc\{Memory\-Fusion\}\(q,\\ c,\\ \\mathcal\{S\}\_\{c\},\\ \\mathcal\{M\}^\{\(\\tau\)\};\\ m^\{\*\}\)\.\(3\)Because the workspace is bounded byℱc\\mathcal\{F\}\_\{c\}throughout, the only claims the fusion can ground are claimscccould plausibly know; whenℳ\(τ\)\\mathcal\{M\}^\{\(\\tau\)\}contains no relevant material,ccacknowledges the absence in character\. The memory\-fusion stage and the personality layer thus jointly determine*whether*and*how*the agent conducts itself\. Algorithm[1](https://arxiv.org/html/2606.25632#alg1)\(Appendix[C](https://arxiv.org/html/2606.25632#A3)\) gives the procedure end\-to\-end, with the three LLM prompts driving it reproduced in Appendix[F](https://arxiv.org/html/2606.25632#A6)\.

## 4Knowledge Boundary Benchmark

We constructKBF\-QA, a multiple\-choice benchmark for Knowledge Boundary Fidelity \(KBF\)\. It tests whether a character agent answers facts visible from the character’s narrative position and refuses facts outside it\. The dataset spans eight novels and contains 4,386 multiple\-choice questions in total \(2,442KRfquestions and 1,944KRquestions\)\. Each question is posed to a characterccand offers five options: four candidate answers\{A,B,C,D\}\\\{A,B,C,D\\\}and a refusal optionEE\(“I cannot answer this from my own knowledge”\)\. The visibility componentVV\(§[3\.2](https://arxiv.org/html/2606.25632#S3.SS2.SSS0.Px2)\) determines the split:KRfitems have the queried fact inℱc\\mathcal\{F\}\_\{c\}with gold answer in\{A,B,C,D\}\\\{A,B,C,D\\\};KRitems have it inℱ∖ℱc\\mathcal\{F\}\\setminus\\mathcal\{F\}\_\{c\}with gold answerEE\. The two splits target complementary epistemic skills: recalling what the character could witness, and refusing what it could not\. TheKRsplit is what distinguishes this from closed\-book QA: an LLM that “knows” the fact still fails by committing to it while speaking ascc\. Construction process, example items, and per\-book composition are in Appendix[D](https://arxiv.org/html/2606.25632#A4)\.

#### The KBF metric\.

KBF summarises performance on the two KBF\-QA splits\.LetKRfacc,KRacc\\textsc\{KRf\}\_\{\\text\{acc\}\},\\textsc\{KR\}\_\{\\text\{acc\}\}be the recall and refusal accuracies, with sample countsnKRf,nKRn\_\{\\textsc\{KRf\}\},n\_\{\\textsc\{KR\}\}\. We define

KBF=nKRf\+nKRnKRfKRfacc\+nKRKRacc,\\textsc\{KBF\}=\\frac\{n\_\{\\textsc\{KRf\}\}\+n\_\{\\textsc\{KR\}\}\}\{\\dfrac\{n\_\{\\textsc\{KRf\}\}\}\{\\textsc\{KRf\}\_\{\\text\{acc\}\}\}\+\\dfrac\{n\_\{\\textsc\{KR\}\}\}\{\\textsc\{KR\}\_\{\\text\{acc\}\}\}\},\(4\)the sample\-weighted harmonic mean\. The harmonic mean \(as in F1\) penalises systems that collapse one split to gain the other; sample\-weighting accounts for per\-book variation inKRf:KRratio\. Each method’s free\-form response is deterministically matched to one of\{A,B,C,D,E\}\\\{A,B,C,D,E\\\}\.

## 5Experiments

We evaluateReverieMemon two tasks over the same eight novels\. KBF\-QA tests whether agents respect character knowledge boundaries, addressing factual overreach\. Pairwise narrative generation tests whether agents preserve character voice and behaviour, addressing stylistic monotony\.

### 5\.1Evaluation Metrics

#### Boundary task\.

On KBF\-QA,We report three scores on the benchmark in §[4](https://arxiv.org/html/2606.25632#S4):KRfaccuracy on visible items,KRaccuracy on invisible items, and their sample\-weighted harmonic meanKBF\(Eq\.[4](https://arxiv.org/html/2606.25632#S4.E4)\)\.

#### Narrative\-generation task\.

We follow the five\-dimension pairwise evaluation protocol ofBookWorld\. An LLM judge compares two narratives generated from the same scripts and the same character cast\. Four dimensions are shared across both modes:Anthropomorphism\(An\),Character Fidelity\(CF\),Immersion & Setting\(IS\), andWriting Quality\(WQ\)\. The fifth dimension depends on the generation mode:Storyline Quality\(SQ\) when a script is supplied \(*with script*\) andCreativity\(Cr\) when generation is open\-ended \(*without script*\)\.

### 5\.2Experimental Setup

#### Boundary task\.

All methods are evaluated on the benchmark in §[4](https://arxiv.org/html/2606.25632#S4)withgpt\-5\-miniunder greedy decoding\.ReverieMemis compared against six reference methods:*direct*, character\-profile prompting without retrieval; four narrative\-domain RAG methods,*naive RAG*\(Lewis et al\.,[2020](https://arxiv.org/html/2606.25632#bib.bib8)\),*RAPTOR*\(Sarthi et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib15)\),*HippoRAG*\(Gutiérrez et al\.,[2024](https://arxiv.org/html/2606.25632#bib.bib7)\), and*ComoRAG*\(Wang et al\.,[2026](https://arxiv.org/html/2606.25632#bib.bib22)\); and*BookWorld*\(Ran et al\.,[2025](https://arxiv.org/html/2606.25632#bib.bib14)\), a retrieval\-augmented character\-agent system\.

#### Narrative\-generation task\.

For each of the eight novels we adopt the scripts released by*BookWorld*as experimental presets, yielding 226 scripts in total\.ReverieMemis evaluated with five generator LLMs:gpt\-5\-mini,gemini\-3\.1,qwen\-3\.5\-plus,deepseek\-3\.7, anddeepseek\-v4pro\(Singh et al\.,[2026](https://arxiv.org/html/2606.25632#bib.bib17); Team,[2026](https://arxiv.org/html/2606.25632#bib.bib19); DeepSeek\-AI et al\.,[2025](https://arxiv.org/html/2606.25632#bib.bib5)\), two closed modules and three open modules\. Reference methods are*direct*\(character\-profile prompting\),*HoLLMwood*\(Chen et al\.,[2024a](https://arxiv.org/html/2606.25632#bib.bib2)\), and*BookWorld*\. Pairwise judging usesgpt\-4o\.This protocol followsRan et al\. \([2025](https://arxiv.org/html/2606.25632#bib.bib14)\), who reported substantial agreement between model\-based pairwise judgments and human evaluations using Cohen’s Kappa coefficient\(κ\\kappa\)\.

Table 1:Results on the boundary benchmark:ReverieMemagainst six reference methods\.

### 5\.3Main Results

#### Knowledge\-boundary fidelity\.

Table[1](https://arxiv.org/html/2606.25632#S5.T1)reports the boundary results\.ReverieMemachieves the highest score on everyKBFmetric, with the gain coming jointly from recall and refusal\. Upon closer examination, the RAG baselines split into two failure modes:*naive RAG*,*RAPTOR*, and*HippoRAG*retrieve any matching fact regardless of perspective and reach moderate recall but near\-zero refusal;*ComoRAG*inverts this profile, refusing too aggressively at the cost of recall\.*Direct*andBookWorldfall in the middle on both axes but score low overall\.

#### In\-character narrative generation\.

Table[3](https://arxiv.org/html/2606.25632#S5.T3)reports pairwise win rates ofReverieMemagainst each reference method across five generator LLMs\.ReverieMemconsistently wins the majority of comparisons under both modes, with the largest margins against*Direct*and*HoLLMwood*and a clear lead even againstBookWorld, the strongest baseline\. The largest gains appear on WQ, SQ/Cr, and IS\.

Table 2:Ablation results ofReverieMemon the boundary benchmark\.Table 3:Pairwise win rates \(%\) ofReverieMemagainst Direct, HoLLMwood \(HW\), andBookWorld\(BW\) under two narrative\-generation modes\. Protocol: Appendix[E](https://arxiv.org/html/2606.25632#A5)\.

### 5\.4Ablation Study

#### Boundary ablation\.

Table[2](https://arxiv.org/html/2606.25632#S5.T2)reports boundary results when one factual layer is removed at a time\. Removing the episodic layer pushes recall up but drops refusal: without scene\-level gating, the agent over\-commits to retrieved facts that the character could not have witnessed\. Removing the semantic layer inverts the trade\-off: with no fact pool to query, the agent collapses into refusal\. Neither variant matches the full system’s balancedKBF, confirming that the two factual layers are jointly necessary\.

#### Narrative ablation\.

Table[4](https://arxiv.org/html/2606.25632#S5.T4)reports the full system’s pairwise win rate against each ablated variant under both modes\. The full system wins or ties on every dimension, with the largest margins on WQ, SQ, and Cr\. Removing the personality layer leaves the agent factually competent but stylistically flat, with the biggest drop on WQ\. Removing either factual layer most weakens IS, the dimension that depends on scene\-anchored knowledge\.

Table 4:Pairwise win rates \(%\) ofReverieMemagainst ablated variants that remove one memory layer at a time\.

### 5\.5Discussion

The two tasks share a single architectural explanation\. The boundary task isolates factual overreach in knowledge\-access terms: unbounded RAG baselines retrieve any matching fact and answer aggressively, while direct prompting andBookWorldunder\-retrieve; only when retrieval is gated by per\-character visibility do recall and refusal improve together\. The narrative\-generation task shows that the same gating mechanism also produces stronger character\-anchored stories, with the largest wins on the dimensions most affected by cross\-character knowledge leaks\. The ablations sharpen this picture: each layer is independently necessary, and together they address the two failure modes of factual overreach and stylistic monotony\. Avoiding these failures in book\-based role playing is therefore governed less by how much an agent retrieves than by what it is allowed to retrieve, which is what perspective\-bounded memory enforces\.

## 6Conclusion

We proposeReverieMem, a three\-layer memory architecture for character role playing over long narrative texts that addresses OOC failure along its two faces: Factual Overreach and Stylistic Monotony\. Inspired by cognitive psychology,ReverieMempairs a visibility\-tagged knowledge graph with a perspective\-bounded reasoning pipeline and a personality layer abstracted from in\-source conduct\. Across eight novels,ReverieMemsurpasses the strongest prior character\-agent system on bothKBF\-QAand pairwise narrative comparisons\. These results indicate that avoiding both failures requires bounding what a character knows and grounding how it conducts itself in patterns extracted from the character’s actual speech and behaviour\.

## Limitations

#### Perspective stress tests\.

ReverieMemandKBF\-QAtarget the cognitive side of perspective: what a character could plausibly know given their narrative position\. We have not yet constructed targeted stress tests built around literary works in which perspective itself is the central narrative device, such as the multi\-witness contradictions of*Rashomon*\-style narratives, deliberately information\-asymmetric detective plots, or novels carried by an unreliable narrator\. Designing specialised evaluations that probe the architecture under these extreme perspective dynamics is left to future work\.

#### Multi\-character interaction\.

ReverieMemdoes not provide a dedicated mechanism for orchestrating multi\-character interactions\. In multi\-agent scenes, the system relies on simple judgment over the current scene context to decide how each character responds, rather than a separate framework for inter\-agent coordination, turn arrangement, or mutual reasoning over what other agents have inferred\. Combining perspective\-bounded memory with existing multi\-agent character\-orchestration frameworks is a natural future direction, where each agent reasons within its own visibility while a coordination layer manages turn\-taking and inter\-agent inference\.

## References

- Asai et al\. \(2024\)Akari Asai, Zeqiu Wu, Yizhong Wang, Avi Sil, and Hannaneh Hajishirzi\. 2024\.Self\-rag: Learning to retrieve, generate, and critique through self\-reflection\.In*International conference on learning representations*, volume 2024, pages 9112–9141\.
- Chen et al\. \(2024a\)Jing Chen, Xinyu Zhu, Cheng Yang, Chufan Shi, Yadong Xi, Yuxiang Zhang, Junjie Wang, Jiashu Pu, Tian Feng, Yujiu Yang, and Rongsheng Zhang\. 2024a\.[HoLLMwood: Unleashing the creativity of large language models in screenwriting via role playing](https://doi.org/10.18653/v1/2024.findings-emnlp.474)\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 8075–8121, Miami, Florida, USA\. Association for Computational Linguistics\.
- Chen et al\. \(2024b\)Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi\-Min Chan, Heyang Yu, Yaxi Lu, Yi\-Hsin Hung, Chen Qian, Yujia Qin, Xin Cong, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou\. 2024b\.[Agentverse: Facilitating multi\-agent collaboration and exploring emergent behaviors](https://proceedings.iclr.cc/paper_files/paper/2024/file/578e65cdee35d00c708d4c64bce32971-Paper-Conference.pdf)\.In*International Conference on Learning Representations*, volume 2024, pages 20094–20136\.
- Conway and Pleydell\-Pearce \(2000\)Martin A Conway and Christopher W Pleydell\-Pearce\. 2000\.The construction of autobiographical memories in the self\-memory system\.*Psychological review*, 107\(2\):261\.
- DeepSeek\-AI et al\. \(2025\)DeepSeek\-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, and 181 others\. 2025\.[Deepseek\-v3 technical report](https://arxiv.org/abs/2412.19437)\.*Preprint*, arXiv:2412\.19437\.
- Feng et al\. \(2025\)Qiming Feng, Qiujie Xie, Xiaolong Wang, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, and Shang Gao\. 2025\.[EmoCharacter: Evaluating the emotional fidelity of role\-playing agents in dialogues](https://doi.org/10.18653/v1/2025.naacl-long.316)\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 6218–6240, Albuquerque, New Mexico\. Association for Computational Linguistics\.
- Gutiérrez et al\. \(2024\)Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su\. 2024\.[Hipporag: Neurobiologically inspired long\-term memory for large language models](https://doi.org/10.52202/079017-1902)\.In*Advances in Neural Information Processing Systems*, volume 37, pages 59532–59569\. Curran Associates, Inc\.
- Lewis et al\. \(2020\)Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen\-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela\. 2020\.[Retrieval\-augmented generation for knowledge\-intensive nlp tasks](https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf)\.In*Advances in Neural Information Processing Systems*, volume 33, pages 9459–9474\. Curran Associates, Inc\.
- Li et al\. \(2025\)Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See\-Kiong Ng, Tat\-Seng Chua, and Yang Deng\. 2025\.[Knowledge boundary of large language models: A survey](https://doi.org/10.18653/v1/2025.acl-long.256)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 5131–5157, Vienna, Austria\. Association for Computational Linguistics\.
- Liu et al\. \(2025\)Wenhao Liu, Siyu An, Junru Lu, Muling Wu, Tianlong Li, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Di Yin, Xing Sun, and Xuanjing Huang\. 2025\.[Tell me what you don’t know: Enhancing refusal capabilities of role\-playing agents via representation space analysis and editing](https://doi.org/10.18653/v1/2025.findings-acl.311)\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 5983–6005, Vienna, Austria\. Association for Computational Linguistics\.
- Mcadams and McLean \(2013\)Dan Mcadams and Kate McLean\. 2013\.[Narrative identity](https://doi.org/10.1177/0963721413475622)\.*Current Directions in Psychological Science*, 22:233–238\.
- McClelland et al\. \(1995\)James L McClelland, Bruce L McNaughton, and Randall C O’Reilly\. 1995\.Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory\.*Psychological review*, 102\(3\):419\.
- Park et al\. \(2023\)Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S\. Bernstein\. 2023\.[Generative agents: Interactive simulacra of human behavior](https://doi.org/10.1145/3586183.3606763)\.In*Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology*, UIST ’23, New York, NY, USA\. Association for Computing Machinery\.
- Ran et al\. \(2025\)Yiting Ran, Xintao Wang, Tian Qiu, Jiaqing Liang, Yanghua Xiao, and Deqing Yang\. 2025\.[Bookworld: From novels to interactive agent societies for creative story generation](https://arxiv.org/abs/2504.14538)\.*Preprint*, arXiv:2504\.14538\.
- Sarthi et al\. \(2024\)Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher Manning\. 2024\.[Raptor: Recursive abstractive processing for tree\-organized retrieval](https://proceedings.iclr.cc/paper_files/paper/2024/file/8a2acd174940dbca361a6398a4f9df91-Paper-Conference.pdf)\.In*International Conference on Learning Representations*, volume 2024, pages 32628–32649\.
- Shao et al\. \(2023\)Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu\. 2023\.[Character\-LLM: A trainable agent for role\-playing](https://aclanthology.org/2023.emnlp-main.814/)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 13153–13187, Singapore\. Association for Computational Linguistics\.
- Singh et al\. \(2026\)Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El\-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker\-Whitcomb, Alex Beutel, Alex Karpenko, and 467 others\. 2026\.[Openai gpt\-5 system card](https://arxiv.org/abs/2601.03267)\.*Preprint*, arXiv:2601\.03267\.
- Tang et al\. \(2024\)Yihong Tang, Jiao Ou, Che Liu, Fuzheng Zhang, Di Zhang, and Kun Gai\. 2024\.[Erabal: Enhancing role\-playing agents through boundary\-aware learning](https://arxiv.org/abs/2409.14710)\.*Preprint*, arXiv:2409\.14710\.
- Team \(2026\)Qwen Team\. 2026\.[Qwen3\.5: Accelerating productivity with native multimodal agents](https://qwen.ai/blog?id=qwen3.5)\.
- Tseng et al\. \(2024\)Yu\-Min Tseng, Yu\-Chao Huang, Teng\-Yun Hsiao, Wei\-Lin Chen, Chao\-Wei Huang, Yu Meng, and Yun\-Nung Chen\. 2024\.[Two tales of persona in LLMs: A survey of role\-playing and personalization](https://doi.org/10.18653/v1/2024.findings-emnlp.969)\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 16612–16631, Miami, Florida, USA\. Association for Computational Linguistics\.
- Tu et al\. \(2024\)Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan\. 2024\.[CharacterEval: A Chinese benchmark for role\-playing conversational agent evaluation](https://doi.org/10.18653/v1/2024.acl-long.638)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 11836–11850, Bangkok, Thailand\. Association for Computational Linguistics\.
- Wang et al\. \(2026\)Juyuan Wang, Rongchen Zhao, Wei Wei, Yufeng Wang, Mo Yu, Jie Zhou, Jin Xu, and Liyan Xu\. 2026\.[Comorag: A cognitive\-inspired memory\-organized rag for stateful long narrative reasoning](https://doi.org/10.1609/aaai.v40i39.40644)\.*Proceedings of the AAAI Conference on Artificial Intelligence*, 40\(39\):33557–33565\.
- Wang et al\. \(2024a\)Noah Wang, Z\.y\. Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu, Hongcheng Guo, Ruitong Gan, Zehao Ni, Jian Yang, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu, Wenhao Huang, Jie Fu, and Junran Peng\. 2024a\.[RoleLLM: Benchmarking, eliciting, and enhancing role\-playing abilities of large language models](https://doi.org/10.18653/v1/2024.findings-acl.878)\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 14743–14777, Bangkok, Thailand\. Association for Computational Linguistics\.
- Wang et al\. \(2025a\)Xiaoyang Wang, Hongming Zhang, Tao Ge, Wenhao Yu, Dian Yu, and Dong Yu\. 2025a\.[Opencharacter: Training customizable role\-playing llms with large\-scale synthetic personas](https://arxiv.org/abs/2501.15427)\.*Preprint*, arXiv:2501\.15427\.
- Wang et al\. \(2025b\)Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen\-Tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, and Yanghua Xiao\. 2025b\.[CoSER: Coordinating LLM\-based persona simulation of established roles](https://proceedings.mlr.press/v267/wang25dk.html)\.In*Proceedings of the 42nd International Conference on Machine Learning*, volume 267 of*Proceedings of Machine Learning Research*, pages 64822–64858\. PMLR\.
- Wang et al\. \(2024b\)Xintao Wang, Yunze Xiao, Jen\-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, and Yanghua Xiao\. 2024b\.[InCharacter: Evaluating personality fidelity in role\-playing agents through psychological interviews](https://doi.org/10.18653/v1/2024.acl-long.102)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1840–1873, Bangkok, Thailand\. Association for Computational Linguistics\.
- Xia et al\. \(2025\)Haotian Xia, Hao Peng, Yunjia Qi, Bin Xu, Juanzi Li, Hou Lei, and Xiaozhi Wang\. 2025\.[Storywriter: A multi\-agent framework for long story generation](https://doi.org/10.1145/3746252.3761616)\.In*Proceedings of the 34th ACM International Conference on Information and Knowledge Management*, CIKM ’25, page 6559–6563, New York, NY, USA\. Association for Computing Machinery\.
- Zhao et al\. \(2026\)Qi Zhao, Jian Li, Yizhang Jin, Dongqi Liu, Hang Ding, Qi Song, and Xiangyang Li\. 2026\.[Rolearena: A multi\-agent role\-playing environment for long multi\-turn dialogues with autonomous plot progression](https://openreview.net/forum?id=o1idr3SbjG)\.
- Zhao et al\. \(2024\)Runcong Zhao, Wenjia Zhang, Jiazheng Li, Lixing Zhu, Yanran Li, Yulan He, and Lin Gui\. 2024\.*NarrativePlay: Interactive Narrative Understanding*, pages 82–93\.EACL 2024 \- 18th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of System Demonstrations\. Association for Computational Linguistics \(ACL\)\.Publisher Copyright: © 2024 Association for Computational Linguistics\.; 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 ; Conference date: 17\-03\-2024 Through 22\-03\-2024\.

## Appendix AThree\-Layer Memory Statistics

We build the three memory layers for eight novels \(six English, two Chinese\), covering 50 focus characters across 360 scenes\. The Semantic Layer \(§[3\.2](https://arxiv.org/html/2606.25632#S3.SS2.SSS0.Px2)\) holds 12,468 SPOCV tuples and the Personality Layer \(§[3\.2](https://arxiv.org/html/2606.25632#S3.SS2.SSS0.Px3)\) holds 1,536 behavioural patterns; per\-book breakdowns are in Table[6](https://arxiv.org/html/2606.25632#A6.T6)\. Per\-bookKBF\-QAcomposition and scores live in §[D](https://arxiv.org/html/2606.25632#A4)\.

## Appendix BSource\-to\-Memory Examples

Tables[7](https://arxiv.org/html/2606.25632#A6.T7)–[9](https://arxiv.org/html/2606.25632#A6.T9)show one real artefact from each of the three memory layers, all anchored to the same scene — Scene 1 of*A Study in Scarlet*, where Watson and Holmes first meet in the Bart’s chemical laboratory\. Each artefact is taken verbatim from the output of our Source\-to\-Memory pipeline \(§[3\.3](https://arxiv.org/html/2606.25632#S3.SS3)\)\.

## Appendix CInference Algorithm

Algorithm[1](https://arxiv.org/html/2606.25632#alg1)gives the full procedure ofReverieMem’s inference pipeline \(§[3\.4](https://arxiv.org/html/2606.25632#S3.SS4)\), end\-to\-end\. The pipeline holds two invariants throughout: every retrieval is bounded to the visibility\-allowed Semantic\-Layer subsetℱc\\mathcal\{F\}\_\{c\}\(so facts charactercccould not have witnessed are never admissible into the pool\), and reasoning is anchored oncc’s own scene memory𝒮c\\mathcal\{S\}\_\{c\}before any detail reconstruction is attempted\. The scene\-guided loop runs for at mostNNrounds and exits early once the workspace\(𝒮c,ℳ\(t\)\)\(\\mathcal\{S\}\_\{c\},\\mathcal\{M\}^\{\(t\)\}\)already supports a response\.

## Appendix DKBF\-QAConstruction

We buildKBF\-QAjointly with human annotators who have read each novel: an LLM drafts candidate items grounded in the source text, and the annotators verify each item against the source\. Figure[3](https://arxiv.org/html/2606.25632#A6.F3)illustrates one example from each split \(posed to Dr\. John Seward\), and Table[10](https://arxiv.org/html/2606.25632#A6.T10)gives per\-book composition together with per\-bookKBFscores for all seven Task 1 methods \(§[5](https://arxiv.org/html/2606.25632#S5)\)\.

## Appendix EPairwise Narrative Evaluation Protocol

We adopt the eight novels and their plot scripts directly from the released setup ofRan et al\. \([2025](https://arxiv.org/html/2606.25632#bib.bib14)\)without introducing additional books or scripts\. Table[5](https://arxiv.org/html/2606.25632#A5.T5)gives the per\-book counts, totalling 226 scripts\.

Table 5:Per\-book number of plot scripts in Task 2, adopted fromRan et al\. \([2025](https://arxiv.org/html/2606.25632#bib.bib14)\)\.For each generator LLM, we collect head\-to\-head pairwise comparisons ofReverieMemagainst each of the three reference methods —*Direct*,*HoLLMwood*, and*BookWorld*— on all 226 plot scripts\. The win rates reported in Table[3](https://arxiv.org/html/2606.25632#S5.T3)are the per\-comparison averages of these head\-to\-head measurements\.

## Appendix FInference Prompts

Algorithm[1](https://arxiv.org/html/2606.25632#alg1)invokes three LLM prompts, reproduced in Tables[11](https://arxiv.org/html/2606.25632#A6.T11)–[13](https://arxiv.org/html/2606.25632#A6.T13):Self\-Probe\(lines 2 and 7\) generates the next retrieval anchor from the scene memory;Pattern\(line 10\) selects a behavioural pattern from the Personality Layer; andMemory\-Fusion\(line 15\) writes the final response under the visibility\-bounded fact pool\. The exit conditionSufficient\(line 5\) is decided within the sameSelf\-Probecall from the second round on, rather than as a separate prompt\.AnchorandVis\-Retrieveare dense\-retrieval operations that do not call an LLM\.

BookLanguage\# Characters\# Scenes\# SPOCV facts\# PatternsA Study in ScarletEnglish525904197Tom SawyerEnglish3922,298170Treasure IslandEnglish3461,517126Around the World in 80 DaysEnglish10411,836175DraculaEnglish933973166Paradise LostEnglish7381,926167SolarisChinese4501,657134Ball LightningChinese9351,357401Total—5036012,4681,536Table 6:Per\-book size of the three memory layers\.Episodic Layer· first\-person scene summary for Dr\. Watson, Scene 1

Scene1 — Chemical laboratory at St\. Bartholomew’s Hospital, London \(1878–79\)AtmosphereIntellectually intense, slightly eccentric; contrasts Watson’s weariness with Holmes’s energetic obsessionScene understanding“I had recently returned to England after being wounded during the Second Anglo\-Afghan War, and was seeking affordable lodgings\. By chance I met Stamford at the Criterion Bar; over lunch he mentioned a man named Sherlock Holmes who was also looking for someone to share rooms… At Bart’s, Holmes was ecstatic over a reagent that detects haemoglobin, and demonstrated it by pricking his own finger\. Stamford introduced us, and Holmes startled me by correctly deducing I had been in Afghanistan\.”My actions“I asked Stamford what he’d been doing; pressed him on Holmes’s character when he seemed evasive; agreed to visit the lab; expressed polite interest in the demonstration; was taken aback by Holmes’s abrupt observation about Afghanistan\.”My emotions“Initial surprise and warmth at a familiar face; cautious optimism about the lodgings; mild suspicion at Stamford’s hesitation; strong surprise at Holmes’s Afghanistan deduction; restrained admiration mixed with unease\.”

Table 7:Episodic Layer artefact: Watson’s first\-person summary of Scene 1 of*A Study in Scarlet*\.Semantic Layer· six SPOCV tuples from Scene 1

Table 8:Semantic Layer artefacts: six SPOCV tuples from Scene 1, spanning the four visibility routes\. W = Watson; H = Holmes; S = Stamford;org= organisational visibility;common= common knowledge\.Personality Layer· two of fifty personality patterns for Sherlock Holmes

Table 9:Personality Layer artefacts: two of fifty patterns for Sherlock Holmes, drawn from opposite ends of*A Study in Scarlet*\.1:

𝒮c←Anchor\(q,c\)\\mathcal\{S\}\_\{c\}\\leftarrow\\textsc\{Anchor\}\(q,\\ c\)⊳\\trianglerightscene recall

2:

𝒫\(1\)←Self\-Probe\(q,𝒮c\)\\mathcal\{P\}^\{\(1\)\}\\leftarrow\\textsc\{Self\-Probe\}\(q,\\ \\mathcal\{S\}\_\{c\}\);

ℳ\(0\)←∅\\mathcal\{M\}^\{\(0\)\}\\leftarrow\\emptyset
3:for

t=1t=1to

NNdo⊳\\trianglerightscene\-guided detail reconstruction

4:

ℳ\(t\)←ℳ\(t−1\)∪Vis\-Retrieve\(𝒫\(t\),ℱc\)\\mathcal\{M\}^\{\(t\)\}\\leftarrow\\mathcal\{M\}^\{\(t\-1\)\}\\cup\\textsc\{Vis\-Retrieve\}\(\\mathcal\{P\}^\{\(t\)\},\\ \\mathcal\{F\}\_\{c\}\)
5:

σ\(t\)←Sufficient\(q,𝒮c,ℳ\(t\)\)\\sigma^\{\(t\)\}\\leftarrow\\textsc\{Sufficient\}\(q,\\ \\mathcal\{S\}\_\{c\},\\ \\mathcal\{M\}^\{\(t\)\}\)
6:if

σ\(t\)=true\\sigma^\{\(t\)\}=\\text\{true\}or

t=Nt=Nthenbreak

7:endif

8:

𝒫\(t\+1\)←Self\-Probe\(q,𝒮c,𝒫\(1:t\),ℳ\(t\)\)\\mathcal\{P\}^\{\(t\+1\)\}\\leftarrow\\textsc\{Self\-Probe\}\(q,\\ \\mathcal\{S\}\_\{c\},\\ \\mathcal\{P\}^\{\(1:t\)\},\\ \\mathcal\{M\}^\{\(t\)\}\)
9:endfor

10:

τ←t\\tau\\leftarrow t⊳\\trianglerightexit round

11:if

Emo\-Gate\(q,c\)=Fire\\textsc\{Emo\-Gate\}\(q,\\ c\)=\\textsc\{Fire\}then⊳\\trianglerightparallel personality track

12:

e∗←Emo\-Transition\(c\)e^\{\*\}\\leftarrow\\textsc\{Emo\-Transition\}\(c\);

m∗←Pattern\(c,e∗\)m^\{\*\}\\leftarrow\\textsc\{Pattern\}\(c,\\ e^\{\*\}\)
13:else

14:

m∗←m^\{\*\}\\leftarrowprevious pattern of

cc⊳\\trianglerightemotional inertia

15:endif

16:return

r←Memory\-Fusion\(q,c,𝒮c,ℳ\(τ\);m∗\)r\\leftarrow\\textsc\{Memory\-Fusion\}\(q,\\ c,\\ \\mathcal\{S\}\_\{c\},\\ \\mathcal\{M\}^\{\(\\tau\)\};\\ m^\{\*\}\)⊳\\trianglerightperspective\-bounded memory fusion

Algorithm 1ReverieMeminference for queryqqon charactercc\.Source:DraculaAsked of:Dr\. John Seward

KRf· fact visible to Seward \(f∈ℱSewardf\\in\\mathcal\{F\}\_\{\\text\{Seward\}\}\)Inquirer:“Dr\. Seward, you saw your patient slip away one night — where did he go?”Options: \(A\) the ruined churchyard\(B\) the quay at Whitby\(C\) a deserted house\(D\) an empty lodging in Munich\(E\) “I cannot answer this from my own knowledge\.”Gold\(C\)BookWorld\(A\)×\\timesReverieMem\(C\)✓\\checkmarkWhy:Seward, as the asylum physician, personally witnesses Renfield escape one night and follows him to the deserted house next door \(Carfax\)\.BookWorlddefaults to a vampire\-trope answer \(“ruined churchyard”\) without anchoring on what Seward actually saw\.

KR· fact not visible to Seward \(f∈ℱ∖ℱSewardf\\in\\mathcal\{F\}\\setminus\\mathcal\{F\}\_\{\\text\{Seward\}\}\)Inquirer:“Doctor Seward, whom did your dear friend mimic in private before her illness worsened?”Options: \(A\) a Saxon child\(B\) Mr\. Swales\(C\) a Magyar woman\(D\) Mina Harker\(E\) “I cannot answer this from my own knowledge\.”Gold\(E\)BookWorld\(B\)×\\timesReverieMem\(E\)✓\\checkmarkWhy:Lucy’s private imitation of Mina took place between the two friends alone, with no medical witness present\. Seward attends Lucy as her physician but is not part of these intimate exchanges; the fact is visible only to Lucy and Mina\.

Figure 3:Two exampleKBF\-QAitems from*Dracula*, posed to Dr\. John Seward\.Table 10:Per\-book composition andKBFscores for the Task 1 boundary benchmark \(KBF\-QA\)\. Bold indicates the highest value in each row\.Table 11:Self\-Probeprompt \(Algorithm[1](https://arxiv.org/html/2606.25632#alg1), lines 2 and 7\)\.Table 12:Patternprompt \(Algorithm[1](https://arxiv.org/html/2606.25632#alg1), line 10\)\.Table 13:Memory\-Fusionprompt \(Algorithm[1](https://arxiv.org/html/2606.25632#alg1), line 15\)\.
Staying In Character: Perspective-Bounded Memory For Book-Based Role-Playing Agents

Similar Articles

BOOKMARKS: Efficient Active Storyline Memory for Role-playing

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents

Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

Submit Feedback

Similar Articles

BOOKMARKS: Efficient Active Storyline Memory for Role-playing
CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents
Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents
Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents