ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Hugging Face Daily Papers 06/04/26, 12:00 AM Papers

role-playing language-agents character-arc narrative-evaluation benchmark fine-tuning

Summary

This paper introduces ArcANE, an automatically constructed benchmark for evaluating role-playing language agents' alignment with character psychological trajectories across narrative phases, showing that conditioning on character arc information improves performance, especially in scenarios beyond the source text.

Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether responses align with the character's psychological trajectory, especially in scenarios the source text never explores. We introduce ArcANE (Arc-Aware Narrative Evaluation), an automatically constructed benchmark spanning 17 novels and 80 principal characters. A Character Arc segments the narrative into phases along a psychological axis, and each probe poses the same scenario across phases, spanning both situations within the source text and situations beyond it. Across six models and six context modes, conditioning on the Character Arc tops every other context strategy on every model, and the gap is largest on scenarios outside the source text where retrieval has nothing to find. We further fine-tune open-weight models on the same data to obtain ArcANE-8B/32B, which widen the Arc advantage even more on scenarios outside the source text.

Original Article

View Cached Full Text

Cached at: 06/05/26, 06:07 AM

Paper page - ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Source: https://huggingface.co/papers/2606.05553

Abstract

Role-playing language agents require dynamic character development that evolves through narratives, necessitating benchmarks that evaluate psychological trajectory alignment rather than static factual recall, with ArcANE demonstrating superior performance when character arc information is conditioned into models.

Role-playing language agents(RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether responses align with the character’spsychological trajectory, especially in scenarios the source text never explores. We introduce ArcANE (Arc-AwareNarrative Evaluation), an automatically constructed benchmark spanning 17 novels and 80 principal characters. ACharacter Arcsegments the narrative into phases along a psychological axis, and each probe poses the same scenario across phases, spanning both situations within the source text and situations beyond it. Across six models and six context modes, conditioning on theCharacter Arctops every other context strategy on every model, and the gap is largest on scenarios outside the source text where retrieval has nothing to find. We further fine-tuneopen-weight modelson the same data to obtain ArcANE-8B/32B, which widen the Arc advantage even more on scenarios outside the source text.

View arXiv page View PDF Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.05553 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.05553 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05553 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Paper page - ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

When Roleplaying, Do Models Believe What They Say?

NARRA-Gym for Evaluating Interactive Narrative Agents

Submit Feedback

Similar Articles

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

When Roleplaying, Do Models Believe What They Say?

NARRA-Gym for Evaluating Interactive Narrative Agents