One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

arXiv cs.AI 05/25/26, 04:00 AM Papers
reinforcement-learning npc game-ai persona-conditioning zero-shot-generalization multi-agent life-simulation
Summary
Introduces PCSP, a single RL policy conditioned on frozen LLM embeddings of persona descriptions, enabling scalable, real-time persona-traceable NPC control in life simulation games. Experiments show zero-shot persona identification and behavioral alignment, with faster inference than LLM baselines.
arXiv:2605.23652v1 Announce Type: new Abstract: On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignment, and 22x faster inference than an LLM-as-policy baseline. Life simulation games require hundreds to thousands of non-player characters (NPCs) that behave consistently with distinct personalities while remaining controllable through designer-authored natural language. Existing methods fail on constraints like persona consistency, controllability, or real-time inference. We introduce pcsp (Persona Conditioned Shared Policy), a single reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions. pcsp combines once-per-NPC persona encoding, low-rank persona projection, neural persona conditioning, and a PPO + InfoNCE consistency + KL diversity training objective. Across three experimental settings, ablations show that the InfoNCE trajectory-consistency objective is load bearing: removing it collapses zero-shot persona identification to chance. External validation on Melting Pot 2.4.0 substrates confirms that our method produces persona-conditioned behavioral divergence in multi-agent strategic environments. We distinguish two senses of held-out evaluation: compositional zero-shot and vocabulary-expansion held-out. Finally, a UE5 deployment reproduces the in-engine persona-conditioning ablation at 64 agents with a low failure rate, showing that the sub-frame inference profile survives in a commercial game engine. These results prove that shared RL policies can support scalable, real-time, persona-conditioned NPC control.
Original Article
View Cached Full Text
Cached at: 05/25/26, 08:58 AM
# Persona-Traceable Shared RL Policies for Scalable Game Agents
Source: [https://arxiv.org/html/2605.23652](https://arxiv.org/html/2605.23652)
## One Policy, Infinite NPCs: Persona\-Traceable Shared RL Policies for Scalable Game Agents

###### Abstract

On a 300\-persona life\-simulation benchmark,pcspachieves compositional zero\-shot persona identification \(unseen\-occupation held\-out\) up to 17×\\timesabove chance, Spearmanρ≈0\.73\\rho\\\!\\approx\\\!0\.73semantic\-behavioral alignment, and 22×\\timesfaster inference than an LLM\-as\-policy baseline\. Life simulation games require hundreds to thousands of non\-player characters \(NPCs\) that behave consistently with distinct personalities while remaining controllable through designer\-authored natural language\. Hand\-authored behavior trees, per\-character RL policies, unsupervised skill discovery, and per\-step LLM controllers each fail on one or more deployment constraints: persona consistency, natural\-language controllability, zero\-shot generalization, and real\-time inference\. We introducepcsp\(Persona\-Conditioned Shared Policy\), a single reinforcement learning policy conditioned on frozen LLM embeddings of free\-form persona descriptions\.pcspcombines once\-per\-NPC persona encoding, low\-rank persona projection, neural persona conditioning, and a PPO \+ InfoNCE consistency \+ KL diversity training objective\. Across threepcsp\-d\(formerlyMini\-Inzoi\) experimental settings, including a richer 20\-action v3 ontology, ablations show that the InfoNCE trajectory\-consistency objective is load\-bearing: removing it collapses zero\-shot persona identification to chance even when task reward is preserved or improved\. External validation on threeMelting Pot 2\.4\.0\[[15](https://arxiv.org/html/2605.23652#bib.bib16)\]substrates spanning commons\-pool, public\-good, and dyadic\-matrix social dilemmas \(commons\_harvest\_\_open,clean\_up,prisoners\_dilemma\_in\_the\_matrix\_\_repeated\) confirms that the §[III](https://arxiv.org/html/2605.23652#S3)method produces persona\-conditioned behavioral divergence in multi\-agent strategic substrates and that the consistency\-loss ablation collapses trajectory→\\topersona retrieval to chance in every substrate while leaving \(or inflating\) pairwise action\-KL\. We distinguish two senses of held\-out evaluation:*compositional zero\-shot*\(unseen occupation×\\timesarchetype crosses within the trained persona\-space coverage; the regime of Layer 1 and Layer 3\), and*vocabulary\-expansion held\-out*\(new persona tokens whose embeddings lie inside the convex hull of training embeddings but were never present at training; the regime where Layer 2 still fails, top\-1=0=0, and which we report as an open problem\)\. A UE5 deployment reproduces the in\-engine persona\-conditioning ablation at 64 agents with a 1\.7% failure rate and held\-out zero\-shot generalization at 0\.04% failure, showing that the sub\-frame inference profile and ablation structure survive the move into a commercial game engine\. These results provide empirical evidence that shared RL policies can support scalable, real\-time, persona\-conditioned NPC control and that trajectory traceability is central to the method\.

## IIntroduction

Modern life simulation games—The Sims,Animal Crossing,inZOI, and emerging open\-world titles—place NPCs at the center of the player experience\. For these games to feel alive, each NPC must behave consistently with a distinct personality: the gregarious chef and the reclusive artist should pursue the same biological needs through recognizably different activity patterns, social approaches, and daily rhythms\. At scale—hundreds to thousands of NPCs per game world—this creates a fundamental challenge that current game AI architectures were not designed to address\.

### I\-AThe NPC Personalization Scaling Gap

#### Behavior trees

remain the industrial standard\[[35](https://arxiv.org/html/2605.23652#bib.bib22),[7](https://arxiv.org/html/2605.23652#bib.bib44)\]\. A skilled designer hand\-crafts decision logic for each character archetype\. This produces believable, predictable NPCs, but authoring cost scales linearly with character count, and trees are brittle outside authored scenarios\. Ten thousand distinct NPC personalities require ten thousand hand\-crafted trees\.

#### Per\-NPC RL

appears to offer automation: train a separate policy for each persona\. In principle this allows unlimited characters with arbitrary behaviors\. In practice, memory and training cost also scale linearly, and each new persona requires a full retraining cycle—a prohibitive expense for live\-service games that regularly introduce new characters\.

#### LLM\-as\-policy

methods\[[19](https://arxiv.org/html/2605.23652#bib.bib8),[33](https://arxiv.org/html/2605.23652#bib.bib9),[36](https://arxiv.org/html/2605.23652#bib.bib10),[26](https://arxiv.org/html/2605.23652#bib.bib41),[34](https://arxiv.org/html/2605.23652#bib.bib40)\]achieve rich, natural\-language\-grounded persona expression by querying a language model at every decision step\. This works well in offline or turn\-based contexts, but even our local Qwen3\-1\.7B LLM\-as\-policy baseline requires 43\.7 ms per decision step \(Table[XI](https://arxiv.org/html/2605.23652#A1.T11)\), which exceeds the 16–33 ms per\-frame budget of real\-time game simulations\. Generative Agents\[[19](https://arxiv.org/html/2605.23652#bib.bib8)\]sidestep this by operating on minute\-resolution plans rather than per\-frame actions, which suits narrative simulation but cannot drive moment\-to\-moment NPC movement and activity selection\.

#### Unsupervised skill discovery

\(DIAYN\[[9](https://arxiv.org/html/2605.23652#bib.bib6)\], CIC\[[14](https://arxiv.org/html/2605.23652#bib.bib7)\]\), often paired with intrinsic\-motivation objectives\[[20](https://arxiv.org/html/2605.23652#bib.bib35)\], learns a shared policy conditioned on latent skill codes, achieving behavioral diversity without per\-character training\. But the codes carry no semantic content\. A game designer cannot specify “this NPC should be agreeable and fitness\-oriented” by selecting a latent index\. Natural\-language controllability is not part of the design\.

### I\-BPersona\-Conditioned Shared Policies

We address this scaling problem with a different decomposition:*compute a rich persona representation once per NPC, then condition a single lightweight policy on that representation at every step*\.

The key enabler is the modern LLM embedding space\. A frozen language model maps free\-form persona descriptions to dense vectors that capture semantic relationships among personalities\. A shared policy conditioned on these embeddings can generalize to*any*persona a designer writes, without retraining, because the continuous embedding space provides coverage of the entire personality manifold\. At inference time, the LLM is called once per NPC lifetime; the policy network—at most a few hundred kilobytes—runs at full game speed\.

This approach, which we call*persona\-conditioned shared policies*, remains underexplored in game AI\. This paper presents a concrete implementation,pcsp, and evaluates when persona\-conditioned behavior remains recoverable from generated trajectories\. The central empirical question is not whether a policy can receive persona text as input, but whether its trajectories preserve enough persona signal to support zero\-shot identification, semantic alignment, and real\-time deployment\.

### I\-CPaper Contributions

1. 1\.Method\.pcsp—a shared policy conditioned on frozen LLM persona embeddings via a low\-rank persona projection \(LoRA\-style matrix factorization used as a standalone projection layer, not as a parallel adapter on a pretrained weight\) and FiLM/concat fusion, co\-trained with PPO, an InfoNCE trajectory\-consistency objective, and KL diversity regularization \(§[III](https://arxiv.org/html/2605.23652#S3)\)\.
2. 2\.Three\-layer validation methodology\.We argue that persona\-conditioned agents require*separated*validation of mechanism, generalization, and deployment, and instantiate this with a controlled diagnostic substrate \(pcsp\-d\), an external multi\-agent RL substrate \(Melting Pot\), and a realtime UE5 engine deployment \(§[IV](https://arxiv.org/html/2605.23652#S4)\)\.
3. 3\.Mechanistic finding \(Layer 1\)\.Under controlled conditions, the InfoNCE consistency term is causally responsible for trajectory\-level persona recoverability: removing it preserves task reward but collapses zero\-shot persona identification to chance across three independent environment instantiations ofpcsp\-d\(§[V](https://arxiv.org/html/2605.23652#S5)\)\.
4. 4\.External generalization \(Layer 2\)\.The same method, without algorithmic modification beyond a CNN observation front\-end, transfers to three Melting Pot social\-dilemma substrates spanning commons\-pool, public\-good, and dyadic\-matrix structures \(commons\_harvest\_\_open,clean\_up,prisoners\_dilemma\_…\_repeated\); the consistency\-loss ablation collapses trajectory→\\topersona retrieval to chance in all three while leaving \(or inflating\) pairwise action\-KL \(§[VI](https://arxiv.org/html/2605.23652#S6)\)\. Retrieval over the full1212\-persona vocabulary remains3\.43\.4–4\.9×4\.9\\timeschance top\-1, and a CU→\\toCH cross\-substrate transfer of the persona projection and trajectory encoder retrieves at1\.79×1\.79\\timeschance top\-1 \(§[VI\-A](https://arxiv.org/html/2605.23652#S6.SS1), Tab\.[VII](https://arxiv.org/html/2605.23652#S6.T7)\)\.
5. 5\.Deployment finding \(Layer 3\)\.A frozen Layer\-1 checkpoint deployed in UE5 sustains 64 concurrent persona\-conditioned agents at realtime with 1\.7% failure, generalizes to 60 held\-out personas at 0\.04% failure, and*reproduces the InfoNCE ablation in\-engine*\(matched\-persona symmetric KL of1\.791\.79nats between full andno\_consistcheckpoints, reward1,423\.51\{,\}423\.5vs\.1,079\.81\{,\}079\.8on the same map and persona set\), establishing that the consistency objective is load\-bearing under engine\-side contention \(§[VII](https://arxiv.org/html/2605.23652#S7)\)\.
6. 6\.Reproducibility\.Open ONNX checkpoints, three\-layer benchmark code, UE5 plugin, and trajectory\-annotation harness\.

## IIWhy Current Paradigms Fall Short

Table[I](https://arxiv.org/html/2605.23652#S2.T1)evaluates six paradigms against the four axes that matter for practical life\-simulation NPC deployment\.

TABLE I:Paradigm comparison\. ✓ = satisfied,×\\times= not satisfied,△\\triangle= partially satisfied\. Thepcsprow reports satisfaction*in our evaluated settings*\(the three\-layer stack of §[IV](https://arxiv.org/html/2605.23652#S4)\); open issues such as Melting Pot vocabulary\-expansion held\-out recovery \(§[VI\-A](https://arxiv.org/html/2605.23652#S6.SS1)\), broader human believability, and open\-world generalization beyond Layer 3 are discussed in §[IX](https://arxiv.org/html/2605.23652#S9)\.#### Goal\-conditioned RL\.

UVFA\[[24](https://arxiv.org/html/2605.23652#bib.bib2)\]and HER\[[1](https://arxiv.org/html/2605.23652#bib.bib3)\]condition policies on goal*states*—what to achieve\. Persona describes*how to behave*: two NPCs with identical needs but different personalities should fulfill them through different activity sequences\. Goal conditioning addresses the wrong axis of variation and provides no natural\-language interface\.

#### Language\-conditioned RL\.

BabyAI\[[5](https://arxiv.org/html/2605.23652#bib.bib4)\]conditions on natural\-language*instructions*that change each episode\. Persona is a stable trait persisting over the NPC’s lifetime; treating it as an episode\-level instruction discards the long\-horizon consistency constraint\. Decision Transformer\[[4](https://arxiv.org/html/2605.23652#bib.bib5)\]and similar sequence\-modeling approaches condition behavior on past returns or goals, but not on static identity descriptors that must generalize zero\-shot\.

#### Unsupervised skill discovery\.

DIAYN\[[9](https://arxiv.org/html/2605.23652#bib.bib6)\]and CIC\[[14](https://arxiv.org/html/2605.23652#bib.bib7)\]achieve behavioral diversity via learned latent codes, but the codes carry no semantic content\. Generalizing to a new designer\-specified persona requires solving an inverse problem: find the latent code for “an introverted accountant who prioritizes learning\.” This is not supported by design\.

#### LLM\-as\-policy\.

Generative Agents\[[19](https://arxiv.org/html/2605.23652#bib.bib8)\], Voyager\[[33](https://arxiv.org/html/2605.23652#bib.bib9)\], and ReAct\[[36](https://arxiv.org/html/2605.23652#bib.bib10)\]produce rich, persona\-consistent behavior but remain expensive when used as per\-step controllers\. In our benchmark, the Qwen3\-1\.7B LLM\-as\-policy baseline requires 43\.7 ms per decision step \(Table[XI](https://arxiv.org/html/2605.23652#A1.T11)\), already above a 16–33 ms frame budget\. Techniques such as action caching, plan reuse, and smaller distilled models can reduce latency, but the fundamental tension—rich language reasoning versus real\-time step budgets—remains\. Generalist agents such as Gato\[[22](https://arxiv.org/html/2605.23652#bib.bib11)\]demonstrate that a single model can handle diverse tasks, but inference cost is similar\. Large\-scale self\-play agents such as AlphaStar\[[32](https://arxiv.org/html/2605.23652#bib.bib32)\], OpenAI Five\[[3](https://arxiv.org/html/2605.23652#bib.bib33)\], and AlphaZero\[[27](https://arxiv.org/html/2605.23652#bib.bib34)\]establish that RL can master complex games, but they specialize a single policy to a single game rather than expressing a vocabulary of personas within one world\.

#### Summary\.

No existing paradigm satisfies all four axes\. The design space between “fast but no NL control” and “NL control but too slow” has not been systematically explored\. Persona\-conditioned shared policies are a concrete proposal for how to close this gap\.

## IIIpcsp: Persona\-Conditioned Shared Policy

We presentpcspas one specific point in the design space of persona\-conditioned shared policies\. The components below are motivated by ablation evidence but are not canonical; they define the experimental system evaluated across the three validation layers in §[IV](https://arxiv.org/html/2605.23652#S4)\. The method itself is*environment\-agnostic*: a frozen embedding plus a low\-rank persona projection, a shared policy with a conditioning fusion, and a PPO\+InfoNCE\+KL co\-training objective\. Environment, observation, action, and reward specifications are deferred to §[IV](https://arxiv.org/html/2605.23652#S4)together with the layer they instantiate\.

### III\-APersona Encoding

Given persona textpp, we compute a frozen LLM embeddinge^p∈ℝ1024\\hat\{e\}\_\{p\}\\in\\mathbb\{R\}^\{1024\}using the Transformer\-based\[[31](https://arxiv.org/html/2605.23652#bib.bib38)\]Qwen3\-0\.6B\-Embedding\[[39](https://arxiv.org/html/2605.23652#bib.bib21)\]\(last\-token pooling, L2\-normalized\); contrastively\-trained sentence encoders such as Sentence\-BERT\[[23](https://arxiv.org/html/2605.23652#bib.bib39)\]provide our B3 baseline below\. This is computed once per NPC lifetime \(14\.6 ms/persona\), imposing negligible runtime cost\.

A learned low\-rank persona projection \(r=16r\\\!=\\\!16,lr=10−4\\text\{lr\}\\\!=\\\!10^\{\-4\}\) mapse^p→ep∈ℝ64\\hat\{e\}\_\{p\}\\to e\_\{p\}\\in\\mathbb\{R\}^\{64\}through a low\-dimensional projection layer whose matrix\-factorization structure is inspired by LoRA\[[13](https://arxiv.org/html/2605.23652#bib.bib18)\]—we do not add a parallelBABAadapter to a pretrained weight matrix in the Qwen3 backbone; rather, we factor the standalone1024→641024\\\!\\to\\\!64persona projection as a rank\-1616BABAto keep the learnable persona pathway small while leaving the embedding model frozen:

ep\\displaystyle e\_\{p\}=norm⁡\(αBAe^p\),\\displaystyle=\\operatorname\{norm\}\\\!\\left\(\\alpha\\,BA\\hat\{e\}\_\{p\}\\right\),A\\displaystyle A∈ℝ16×1024,B∈ℝ64×16,α=64−1/2\.\\displaystyle\\in\\mathbb\{R\}^\{16\\times 1024\},\\quad B\\in\\mathbb\{R\}^\{64\\times 16\},\\quad\\alpha=4^\{\-1/2\}\.The projection is necessary: raw Qwen3 embeddings over\-represent occupational similarity relative to personality similarity \(salesperson↔\\leftrightarrowexecutive: 0\.61 vs\. trainer↔\\leftrightarrowblogger: 0\.33\)\. After low\-rank projection training, Spearmanρ\\rhobetween projected persona distance and behavioral KL increases from 0\.384 to 0\.728 \(§[V](https://arxiv.org/html/2605.23652#S5)\)\.

### III\-BPersona\-Conditioned Shared Policy

The shared policyπθ\(a∣s,ep\)\\pi\_\{\\theta\}\(a\\mid s,e\_\{p\}\)is a 3\-hidden\-layer MLP \(256\-256\-128\) with an output head sized to the environment action space\. Our reference implementation uses FiLM\[[21](https://arxiv.org/html/2605.23652#bib.bib17)\]to inject the persona signal at every hidden layer:

hℓ′=γℓ\(ep\)⊙hℓ\+βℓ\(ep\),h\_\{\\ell\}^\{\\prime\}=\\gamma\_\{\\ell\}\(e\_\{p\}\)\\odot h\_\{\\ell\}\+\\beta\_\{\\ell\}\(e\_\{p\}\),with small linear networksγℓ,βℓ:ℝ64→ℝdℓ\\gamma\_\{\\ell\},\\beta\_\{\\ell\}:\\mathbb\{R\}^\{64\}\\to\\mathbb\{R\}^\{d\_\{\\ell\}\}, wheredℓd\_\{\\ell\}is the hidden width of layerℓ\\ell\. An analogous FiLM\-conditioned value networkVψ\(s,ep\)V\_\{\\psi\}\(s,e\_\{p\}\)serves as the critic\. Total trainable parameters: 207K \(excluding the frozen LLM\)\. We treat the choice of conditioning mechanism as an implementation detail rather than a contribution: the v3 zero\-shot ablations in §[V](https://arxiv.org/html/2605.23652#S5)show that input\-concatenation conditioning is competitive with FiLM and in fact generalizes more strongly to unseen occupations\. The load\-bearing component is the consistency objective introduced next, which is what makes persona signal recoverable from trajectories regardless of how the policy is conditioned\.

### III\-CCo\-Training Objective

ℒ=ℒPPO\+λ1ℒconsist\+λ2ℒdiverse,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{PPO\}\}\+\\lambda\_\{1\}\\mathcal\{L\}\_\{\\text\{consist\}\}\+\\lambda\_\{2\}\\mathcal\{L\}\_\{\\text\{diverse\}\},\(1\)withλ1=0\.5\\lambda\_\{1\}\\\!=\\\!0\.5,λ2=0\.1\\lambda\_\{2\}\\\!=\\\!0\.1\(PPO:γ=0\.99\\gamma\\\!=\\\!0\.99,ϵ=0\.2\\epsilon\\\!=\\\!0\.2, GAE\-λ=0\.95\\lambda\\\!=\\\!0\.95, batch=2048=\\\!2048\[[25](https://arxiv.org/html/2605.23652#bib.bib1)\]; cooperative multi\-agent PPO variants\[[37](https://arxiv.org/html/2605.23652#bib.bib28),[8](https://arxiv.org/html/2605.23652#bib.bib29)\]share the same per\-agent update used here\)\.

Consistency loss\.A 2\-layer GRU trajectory encodergη\(τ\)∈ℝ64g\_\{\\eta\}\(\\tau\)\\in\\mathbb\{R\}^\{64\}maps episode trajectories to L2\-normalized embeddings\. InfoNCE\[[30](https://arxiv.org/html/2605.23652#bib.bib19)\]\(T=0\.07T\\\!=\\\!0\.07\) aligns trajectory embeddings with their originating persona embeddings:

ℒconsist=−log⁡exp⁡\(sim\(gη\(τ\),ep\)/T\)∑p′∈ℬexp⁡\(sim\(gη\(τ\),ep′\)/T\)\.\\mathcal\{L\}\_\{\\text\{consist\}\}=\-\\log\\frac\{\\exp\\\!\\bigl\(\\text\{sim\}\(g\_\{\\eta\}\(\\tau\),e\_\{p\}\)/T\\bigr\)\}\{\\sum\_\{p^\{\\prime\}\\in\\mathcal\{B\}\}\\exp\\\!\\bigl\(\\text\{sim\}\(g\_\{\\eta\}\(\\tau\),e\_\{p^\{\\prime\}\}\)/T\\bigr\)\}\.\(2\)This prevents the policy from producing trajectories that cannot be traced back to their conditioning persona\.

Diversity loss\.We maximize the expected KL divergence between policy distributions induced by different personas at shared states:

ℒdiverse=−𝔼s,p≠p′\[DKL\(πθ\(⋅∣s,ep\)∥πθ\(⋅∣s,ep′\)\)\],\\mathcal\{L\}\_\{\\text\{diverse\}\}=\-\\mathbb\{E\}\_\{s,\\,p\\neq p^\{\\prime\}\}\\\!\\bigl\[D\_\{\\mathrm\{KL\}\}\\\!\\bigl\(\\pi\_\{\\theta\}\(\\cdot\\mid s,e\_\{p\}\)\\;\\\|\\;\\pi\_\{\\theta\}\(\\cdot\\mid s,e\_\{p^\{\\prime\}\}\)\\bigr\)\\bigr\],\(3\)computed as a batched forward pass overn=8n\\\!=\\\!8sampled personas at 32 states\. This discourages behavioral collapse toward a persona\-agnostic average\.

![Refer to caption](https://arxiv.org/html/2605.23652v1/figures/fig1_pcsp_pipeline.png)Figure 1:pcsppipeline and training objectives\.Persona text is encoded once per NPC with a frozen Qwen3 embedding model, adapted through a low\-rank projection, and consumed by a shared persona\-conditioned policy duringpcsp\-drollout\. The trajectory encoder provides the InfoNCE consistency signal, while the policy is optimized with PPO and KL diversity regularization\. The rightmostpcsp\-dlabel depicts the base 12\-action/8\-need instantiation; the v3 experiments expand the action ontology to 20 actions\.

## IVEvaluation Strategy: A Three\-Layer Validation Stack

A persona\-conditioned policy must answer three different questions: \(i\) is persona signal causally responsible for trajectory variation; \(ii\) does the same method survive a different environment, observation geometry, and reward structure; and \(iii\) does the policy hold up under the asynchrony, contention, and runtime constraints of a commercial game engine\. A single environment cannot answer all three: the conditions that make causal isolation possible \(full observability, small action space, short episodes\) are mutually exclusive with the conditions that test deployment realism\. We therefore validatepcspon three deliberately different layers and report each layer against the question it can actually answer \(Table[II](https://arxiv.org/html/2605.23652#S4.T2)\)\.

![Refer to caption](https://arxiv.org/html/2605.23652v1/figures/Threelayer-validation-stack.png)Figure 2:Three\-layer validation stack\.Each layer is selected for the question it can isolate; the InfoNCE consistency term is ablated in all three\. Together they cover mechanism, generalization, and deployment\.TABLE II:Three\-layer validation matrix\.Per\-layer evaluation surface, persona partition, and whether the InfoNCE consistency term is ablated inside that layer\.### IV\-ALayer 1 —pcsp\-d: a controlled substrate for persona\-traceability analysis

Validating that a policy’s behavior is causally traceable to its conditioning embedding requires an environment where \(i\) every state transition is fully observable, \(ii\) the action space is small enough to compute exact trajectory distributions and KL divergences, \(iii\) the reward composition is controllable so that the contribution of any persona\-aligned shaping term can be isolated or removed, and \(iv\) episode length is short enough to run thousands of held\-out personas\. No commercial or photorealistic environment satisfies all four simultaneously\. We therefore constructpcsp\-d\(pcsp\-Diagnostic; formerlyMini\-Inzoi\), an intentionally minimal PettingZoo\[[28](https://arxiv.org/html/2605.23652#bib.bib37)\]AEC substrate \(6×\\times6 grid, 4 agents, 20 discrete intents over 8 bio\-social needs\) whose role in this paper is*not*to demonstrate behavioral realism—that is the role of Layers 2 and 3—but to expose the InfoNCE consistency term to controlled ablation under conditions where every causal pathway from persona to trajectory is analytically observable\. We treatpcsp\-das a microscope, not a world\.

Environment\.pcsp\-dhas 8 bio\-social needs \(hunger, sleep, social, leisure, hygiene, fitness, work, learning\) and 12 discrete actions in the v1/v2 base ontology \(8 activity \+ 4 movement\); the primary v3 instantiation expands this to 20 discrete intents \(16 activity \+ 4 movement\) for finer behavioral granularity\. Observations∈ℝ20s\\\!\\in\\\!\\mathbb\{R\}^\{20\}\(v1/v2\) orℝ33\\mathbb\{R\}^\{33\}\(v3\): position, time\-of\-day, needs, summaries of the other agents, and \(v3\) location affordance, social context, and routine regularity\. Reward combines needs satisfaction, persona\-aligned activity bonuses \(\+0\.5\+0\.5for preferred actions\), and social bonuses scaled by Big Five compatibility\[[17](https://arxiv.org/html/2605.23652#bib.bib25),[11](https://arxiv.org/html/2605.23652#bib.bib43)\]\(0\.2\+0\.3×cos\-sim\(BFi,BFj\)0\.2\+0\.3\\times\\text\{cos\-sim\}\(\\text\{BF\}\_\{i\},\\text\{BF\}\_\{j\}\)\)\. The v1/v2 base ontology uses only needs\-driven and persona\-agnostic social shaping; the v3 instantiation additionally adds a per\-action persona\-style term \(§[V](https://arxiv.org/html/2605.23652#S5)\) that is*declared as part of thepcsp\-d\-v3 environment*, not as a method component, and that the InfoNCE ablation \(§[V](https://arxiv.org/html/2605.23652#S5)\) is run against without modification so that any persona\-recoverability collapse cannot be attributed to removing this shaping\. The mechanistic claim of the paper—that the InfoNCE consistency loss is what makes trajectories traceable to their conditioning persona—is therefore demonstrated under*both*the persona\-agnostic\-reward regime \(v1/v2\) and the persona\-style\-reward regime \(v3\): the no\_consist ablation collapses zero\-shot identification to chance in every case\.

Persona dataset\.300 persona texts are generated from 15 Big Five archetypes×\\times20 occupations \(240 train / 60 zero\-shot test\)\. Each text is a 2–3 sentence natural\-language description that an NPC designer might write for a new character\.

### IV\-BLayer 2 — Cross\-substrate generalization on Melting Pot

Layer 1 cannot test whether the method survives a different observation geometry, action ontology, or social\-dilemma structure\. Layer 2 appliespcspunchanged to Melting Pot 2\.4\.0\[[15](https://arxiv.org/html/2605.23652#bib.bib16)\]: an external multi\-agent RL substrate with 88×\\times88×\\times3 RGB observations and an 8\-action ontology\. The same protocol is applied without algorithmic modification across three substrates spanning distinct social\-dilemma structures—commons\_harvest\_\_open\(commons\-pool\),clean\_up\(public\-good\), andprisoners\_dilemma\_\_in\_the\_matrix\_\_repeated\(dyadic matrix\-game\)—reported in §[VI](https://arxiv.org/html/2605.23652#S6)\(Tab\.[VI](https://arxiv.org/html/2605.23652#S6.T6)\)\.

### IV\-CLayer 3 — Realtime deployment in Unreal Engine 5

A persona\-conditioned policy is only meaningful if it survives the engineering pressure of a real game engine: asynchronous tick rates, NavMesh contention, BT failure recovery, ONNX runtime constraints, and shared world state\. Layer 3 deploys a*frozen*Layer\-1 checkpoint into UE5\.5 via a hybrid intent stack and asks three questions that Layers 1 and 2 cannot answer: \(i\) does the policy meet a realtime wall\-budget at deployment scale; \(ii\) does the InfoNCE finding survive engine\-side contention; and \(iii\) do personas maintain identity over horizons far longer than the training episode\.

### IV\-DPer\-layer evaluation protocols and metrics

We evaluate persona\-conditioned NPC control with metrics that separate task performance from persona fidelity\.

#### Persona identification accuracy\.

Given a trajectory, an automatedkk\-NN classifier identifies which held\-out persona generated it\. We report zero\-shot accuracy on held\-out personas, with random chance determined by the number of test personas\.

#### Behavioral diversity\.

Mean pairwise KL divergence between action distributions conditioned on different persona embeddings measures whether policies collapse to persona\-agnostic behavior despite high task reward\.

#### Semantic\-behavioral alignment\.

Spearmanρ\\rhobetween projected persona distances and pairwise behavioral KL divergences measures whether semantic distances in persona space are reflected in behavior space\. The projected embeddings are L2\-normalized, so Euclidean distance is monotonic with cosine distance\. Highρ\\rhowith low KL variance can indicate alignment without meaningful diversity, so both quantities are reported where available\.

#### Inference latency\.

We report GPU ms/step at batch size 1\. Real\-time game integration typically requires per\-step control below a 16–33 ms frame budget\.

#### Rich trajectory observability\.

Identification and believability scores depend on what a trace exposes\. Coarse action sequences can compress out temporal, spatial, social, and stylistic dimensions that personality traits modulate\. We therefore render rollouts with time\-of\-day, location, nearby agents, action style, and short\-form event semantics for human\-facing evaluation artifacts, and treat coarse\-versus\- rich trace comparison as part of the evaluation protocol\.

## VLayer 1 Results: Mechanistic Validation

We evaluatepcspon fourpcsp\-dinstantiations\. v3 is the primary experiment because it exposes a richer action ontology for persona\-conditioned behavior; v1, v2, and v3\-large \(all deferred to App\.[A](https://arxiv.org/html/2605.23652#A1)\) replicate the central InfoNCE finding at the original 12\-action base scale, at a 12×\\times12/16\-agent 12\-action expansion, and at a 12×\\times12/16\-agent*20\-action*expansion at 500 personas, establishing that the load\-bearing component is invariant to grid size, agent count, persona\-set size, and action ontology\.

### V\-AExperimental Setup

v3 \(richer action ontology\)\.pcsp\-d\-v3 keeps the 6×\\times6/4\-agent base scale but expands the action space to20 flat\-discrete actions\(16 activity \+ 4 movement\) for finer behavioral granularity \(e\.g\.,focused\_workvs\.planning\_work,rest\_alonevs\.rest\_with\_others,eat\_quickvs\.eat\_slow\)\. Observations expand to 33 dimensions, adding location\-affordance one\-hot \(8\), social\-context features \(3\), and a routine\-regularity signal \(2\)\. Reward shaping adds a per\-action persona\-style termrstyle=0\.3⋅cos⁡\(𝐩BF,𝐬a\)r\_\{\\text\{style\}\}=0\.3\\cdot\\cos\(\\mathbf\{p\}\_\{\\text\{BF\}\},\\mathbf\{s\}\_\{a\}\)that aligns each action with a hand\-authored Big\-Five style profile\. Same 240/60 unseen\-occupation split as v1; 300 PPO iterations\.

v1 \(base scale\)\.pcsp\-d\-v1: 6×\\times6, 4 agents, 300 personas \(240 train / 60 zero\-shot test\)\. The held\-out 60 personas comprise*four entire occupations*\(HR manager, professor, data scientist, chef\) crossed with all 15 Big Five archetypes, so this evaluatesunseen\-occupation compositional generalization: the target occupation never appears in training, only the personality structure does\. All models train for 300 PPO iterations \(≈\\approx1\.9M steps\) on an NVIDIA RTX 6000 Ada \(49 GB VRAM\)\. Baselines:B1No\-Persona PPO;B3SBERT\-conditioned policy \(all\-MiniLM\-L6\-v2, 384\-dim, frozen\);B4DIAYN \(random 64\-dim\);B5LLM\-as\-policy \(Qwen3\-1\.7B\)\.

v2 \(expanded scale\)\.pcsp\-d\-v2:12×\\times12, 16 agents, 500 personas\(400 train / 100 zero\-shot test\), 200 PPO iterations\. Same hyperparameters and evaluation protocol as v1\. The observation dimension expands to 56 \(position 2, time 1, needs 8, 15 other agents×\\times3\)\. Random zero\-shot chance is 1\.0% \(1/100 test personas\)\.

v3\-large \(richer ontology at the expanded scale\)\.pcsp\-d\-v3\-large combines the v3 20\-action ontology, affordance/social/routine observation, and style reward with the v2 footprint:12×\\times12, 16 agents, 500 personas\(400 train / 100 zero\-shot test\), obs\. dim\. 69, 300 PPO iterations, 3 seeds per arm\. The role of this instantiation is to test whether the v3 InfoNCE finding holds when grid, agent count, and persona\-set size all expand simultaneously under the richer action ontology\. Random zero\-shot chance is 1\.0%\.

### V\-BResults

Tables[III](https://arxiv.org/html/2605.23652#S5.T3),[XI](https://arxiv.org/html/2605.23652#A1.T11), and[XII](https://arxiv.org/html/2605.23652#A1.T12)show key metrics for each setting\. Figure[10](https://arxiv.org/html/2605.23652#A1.F10)shows training reward curves at v1/v2\.

TABLE III:v3 results\(6×\\times6, 4 agents, 300 personas, 20\-action ontology; 240 train / 60 zero\-shot held\-out occupations\)\. ZS Acc: zero\-shotkk\-NN accuracy on 60 unseen\-occupation personas \(random≈1\.7%\\approx\\\!1\.7\\%\); brackets are Wilson 95% CIs\. Coh\.: trajectory coherence ratio \(intra/inter persona cosine similarity\)\.\* Near\-random \(failure mode\)\.

TABLE IV:Conditioning architecture×\\timesOOD split \(v3\)\.Zero\-shotkk\-NN accuracy on three compositional held\-out splits, evaluated against the same trained PCSP checkpoint family \(random≈1\.7%\\approx\\\!1\.7\\%\)\. Wilson 95% CIs in brackets;boldmarks the higher point estimate per row, and † marks rows where Wilson intervals overlap\.†Wilson intervals overlap; gap not significant\. Architecture preference flips with the OOD axis: concat generalizes more strongly when the occupation is novel, FiLM when the archetype is novel\.

### V\-CKey Observations

Across all three experimental settings, the ablation evidence converges on a single load\-bearing component—the InfoNCE consistency loss—while the choice of conditioning architecture proves to be secondary\.

The consistency loss is the load\-bearing component\.Removing the InfoNCE consistency term collapses zero\-shot persona identification to chance in every setting we have tested: v1 drops from 19\.3% to 1\.7% \(random\), v2 drops from 2\.3% to exactly 0% against a 1\.0% baseline, and v3 drops from 17\.0% to 1\.7% \(Wilson CI\[\.01,\.04\]\[\.01,\.04\]\)\. The reward axis hides this failure entirely\. At v1 the no\_consist ablation even slightly*increases*reward \(84\.3 vs\. 83\.4\); at v3 the gap is sharper still \(118\.4 vs\. 104\.1, \+14 reward\) while accuracy collapses by 15\.3 percentage points\. Without the consistency objective, the policy chases reward through persona\-agnostic strategies whose trajectories are no longer traceable back to the conditioning persona, regardless of how the persona signal is injected into the network\.

The conditioning architecture is secondary, and conditional on the type of OOD shift\.We initially conjectured that FiLM’s layer\-wise gating would dominate input concatenation, but the three v3 compositional splits in Table[IV](https://arxiv.org/html/2605.23652#S5.T4)show no universal winner\. Concat dominates when the held\-out axis is occupation; FiLM dominates when the held\-out axis is Big\-Five archetype; on the joint occ\.×\\timesarchetype split the two are statistically indistinguishable\. Thus the conditioning mechanism is a generalization\-gap knob whose best setting depends on*which*dimension of the persona is novel at test time\. The consistency loss, in contrast, is the only component that universally fails across split families when removed\.

Diversity loss prevents behavioral collapse at v1/v2 but its effect weakens on the v3 zero\-shot accuracy axis\.Removing the KL diversity term reduces mean policy KL by an order of magnitude at v1 \(5\.87 to 0\.39,15×15\\times\) and v2 \(5\.40 to 0\.48,11×11\\times\), matching the original behavioral\-collapse story\. At v3 zero\-shot, however, diversity\-loss removal lands within the Wilson CI of full \(16\.0% vs\. 17\.0%\): the policy still achieves persona\-recoverable behavior on the accuracy axis without the explicit KL term\. We report diversity as a useful regularizer for KL\-style behavioral spread rather than as a co\-equal contribution to zero\-shot identification\.

Spearmanρ\\rhois preserved across scales\.pcsp\(full\) achievesρ=0\.728\\rho\\\!=\\\!0\.728in v1 andρ=0\.725\\rho\\\!=\\\!0\.725in v2—a difference of 0\.003 despite a4×4\\timesincrease in grid area,4×4\\timesmore agents, and67%67\\%more personas\. This suggests that semantic\-behavioral alignment is a stable property of the design, not an artefact of the minimal environment\.

### V\-DCoarse\-Trace Human Pilot

To test whether the policy’s persona signal is visible to human readers, we ran a 30\-participant Google Forms pilot using the coarse Korean 2AFC survey\. Each item showed a PCSP\-v3 trajectory as a sequence of coarse action labels and asked participants to choose which of two persona descriptions generated it\. Because the form preserved only item\-level A/B selection ratios, we report aggregate forced\-choice accuracy rather than participant\-level variance, confidence, or response\-time analyses\.

Participants selected the correct persona in 612 of 900 aggregate judgments \(68\.0%68\.0\\%; pooled Wilson 95% CI\[64\.9,71\.0\]\[64\.9,71\.0\]\), above the 50% chance baseline\. Item\-level results were mixed: 15 of 30 items were strongly readable \(≥80%\\geq 80\\%correct\), 2 were moderately readable \(7070–79%79\\%\), 8 were ambiguous \(4040–69%69\\%\), and 5 were misleading \(<40%<40\\%\)\. Thus coarse traces do expose some persona\-conditioned behavior, but they leave a substantial minority of items ambiguous or inverted\. We interpret this as evidence for the observability bottleneck motivating richer trajectory rendering, not as a completed rich\-versus\-coarse human comparison\.

### V\-EQualitative Case Study: Designer\-Authored Personas

To test whetherpcspcan be controlled by persona descriptions outside the synthetic Big\-Five/occupation templates, we constructed 50 designer\-authored personas drawn from five sources at varying distance from the training distribution:The Sims 3trait combinations\[[29](https://arxiv.org/html/2605.23652#bib.bib23)\]\(e\.g\.Workaholic \+ Ambitious \+ Perfectionist;n=10n\\\!=\\\!10\),Animal Crossingvillager personality categories\[[18](https://arxiv.org/html/2605.23652#bib.bib24)\]\(e\.g\.Lazy, Jock, Normal;n=12n\\\!=\\\!12\),Stardew ValleyNPC archetypes \(e\.g\.the reclusive coder, the off\-grid hermit;n=10n\\\!=\\\!10\),Persona\-series confidant descriptions \(e\.g\.the perfectionist student leader, the hikikomori hacker;n=6n\\\!=\\\!6\), and original designer briefs authored from scratch from Big\-Five composites \(e\.g\.burned\-out doctor, sleep\-deprived startup founder;n=12n\\\!=\\\!12\)\. Each source concept was rewritten as a 2–3 sentence natural\-language NPC description and assigned a plausible occupation\. All 50 designer\-authored persona texts are written in*English*, while the 240 training personas are Korean; the case study therefore additionally serves as a fully cross\-lingual robustness probe of the frozen multilingual Qwen3 embedding and the learned low\-rank projection\. No retraining was performed: each text was encoded with the same Qwen3 embedding pipeline, passed through the learned low\-rank projection, and rolled out for five v3 episodes\.

Outcome and failure taxonomy\.Each rollout is classified by overlap between its top\-3 actions and the authored preferred\-action signature\. Outcomes are*success*\(overlap≥2\\geq\\\!2\),*partial*\(=1=\\\!1\), or*failure*\(=0=\\\!0\)\. Failures are further tagged by mechanism:F1ontology gap \(authored behavior is not expressible in the 20\-action space,e\.g\.“protect,” “nurture”\),F2style\-reward conflict \(top\-3 mean style\-cosine with the persona Big\-Five vector is negative\),F3embedding occupational bias \(rollout follows the nearest train\-occupation neighbor rather than the authored personality\),F4trait collision \(a declared multi\-trait composite collapses to a single trait\), andF5residual \(zero overlap with none of the above triggers\)\.

TABLE V:Designer\-authored persona case study \(50 English personas, 5 sources, Korean\-trained policy\)\.Outcome breakdown from five v3 rollouts per persona; failure\-mode column lists the diagnoses present in that source’s failure cases\. Mean cosine similarity to the nearest \(Korean\) training persona is 0\.590 \(range 0\.478–0\.686\)\.![Refer to caption](https://arxiv.org/html/2605.23652v1/figures/fig5_designer_personas_tsne.png)Figure 3:Designer\-authored personas in embedding space: raw vs\. learned projection\.t\-SNE of 240 Korean training persona embeddings \(grey\) plus 50 English designer\-authored personas, color\-coded by source\.\(a\)In the raw 1024\-dim Qwen3 embedding space, the English designer personas form a language\-shifted cluster well outside the Korean training distribution: their mean nearest\-neighbor distance to any training persona is12\.9×12\.9\\timesthe mean train–train nearest\-neighbor distance\.\(b\)After the learned 64\-dim low\-rank persona projection used by the policy, the same designer personas are fully intermixed with training personas across all five sources \(ratio drops to1\.47×1\.47\\times\)\. The projection is what makes the case\-study setting effectively in\-distribution for the trained policy, and is the load\-bearing piece of the cross\-lingual robustness reported in §[V\-E](https://arxiv.org/html/2605.23652#S5.SS5)\.Three patterns are worth highlighting\. First, 43 of 50 personas \(86%\) produce at least one top\-3 action matching the authored intent, and2222\(44%\) match at least two—substantially above the chance baseline of matching 2 of 3 from a 20\-action set \(≈9%\\approx\\\!9\\%\)\. This holds even though every designer\-authored text is in English while the trained policy has only ever seen Korean persona descriptions, providing direct evidence that the multilingual Qwen3 backbone plus the learned low\-rank projection yields a language\-agnostic persona representation\. Figure[3](https://arxiv.org/html/2605.23652#S5.F3)makes the role of the projection explicit: in the raw Qwen3 space the English designer personas form a clearly separated cluster \(12\.9×12\.9\\timesthe train–train nearest\-neighbor distance\), but after the learned low\-rank persona projection that the policy actually consumes \(the low\-rank persona projection\) they are fully intermixed with the Korean training personas \(1\.47×1\.47\\times\)\. The mean cosine similarity between each English persona and its nearest Korean training neighbor drops only modestly \(0\.608 with the Korean\-text variant of the same designer set vs\. 0\.590 here\), and per\-source success rates are stable or higher across the language shift\.

Second, source\-stratified results expose a meaningful gradient: Stardew Valley NPC archetypes generalize best \(50% strong / 40% partial / 10% failure\), The Sims 3 trait combinations and original Big\-Five briefs sit at a similar level \(40% / 50% / 10% and 42% / 42% / 17%\), and the Persona\-series confidants never fail outright \(67% / 33% / 0%\)\. Animal Crossing personality categories remain the hardest source \(25% failure\), concentrated in F1, F2, and F5 tags\. This is consistent with the AC categories being defined almost entirely by stylistic and affective markers \(e\.g\.Cranky, Snooty, Sisterly\) that the v3 ontology cannot distinguish at the action level\.

Third, the dominant failure mechanism is ontology\-limited rather than embedding\-limited\. Of seven failures, five \(F1×\\times3, F4, F5\) involve composite or stylistic concepts the 20\-action ontology cannot distinguish: the*normal villager*\(hygiene/care behavior, F1\),*family\-oriented caretaker*\(protective family role, F1\),*animal clinic worker*\(gentle care semantics, F1\),*sleep\-deprived startup founder*\(multi\-trait collision, F4\), and the*jock villager*\(fitness\-coded social action collapses into the genericsocialize\_initiatebucket, F5\)\. Only two failures are squarely policy\-side: the*cranky villager*and the*restless rideshare driver*\(both F2, the per\-action style reward outranks the persona\-conditioned intent\)\. Notably, the F3 embedding\-occupational\-bias failure that appeared in the Korean\-text variant disappears here—with English designer text and Korean training neighbors, the embedding can no longer latch onto a shared occupation token\. This mirrors the coarse\-action observability problem discussed in §[IX](https://arxiv.org/html/2605.23652#S9)and is the cleanest argument in this work that the next environment iteration should expand the social/stylistic axis of the action space rather than the embedding or training pipeline\.

### V\-FCross\-Scale Replication

The absolute task reward and zero\-shot accuracy are lower in v2, as expected for a substantially harder environment with more test personas\. The same empirical structure is observed across environment scales: the InfoNCE consistency objective is required for zero\-shot persona traceability, and the semantic\-behavioral alignment property \(ρ≈0\.73\\rho\\approx 0\.73\) is preserved\. Figure[11](https://arxiv.org/html/2605.23652#A1.F11)shows the zero\-shot distribution for v1; the v2 pattern is qualitatively similar with narrower per\-persona bars\. The aggregate evaluation gives nearly identical semantic\-behavioral alignment across environments \(ρ=0\.728\\rho\\\!=\\\!0\.728vs\.0\.7250\.725\)\. Figure[12](https://arxiv.org/html/2605.23652#A1.F12)visualizes an independent empirical sample of persona pairs from the same checkpoints, again showing strong monotone alignment at both scales\.

## VILayer 2 Results: Cross\-Substrate Generalization \(Melting Pot\)

pcsp\-disolates persona\-conditioned control in a deliberately small grid world; Layer 2 tests whether the method survives a different observation geometry, action ontology, and social\-dilemma structure\. Cooperative multi\-agent benchmarks such as Hanabi\[[2](https://arxiv.org/html/2605.23652#bib.bib31)\]and centralized\-critic algorithms\[[10](https://arxiv.org/html/2605.23652#bib.bib30),[16](https://arxiv.org/html/2605.23652#bib.bib13)\]target related challenges of joint policy learning, but do not address per\-agent persona conditioning\. We applypcspunchanged—no algorithmic modifications beyond the CNN front\-end required by RGB inputs—to threeMelting Pot 2\.4\.0\[[15](https://arxiv.org/html/2605.23652#bib.bib16)\]substrates spanning three distinct social\-dilemma categories:commons\_harvest\_\_open\(commons\-pool resource\),clean\_up\(public\-good provisioning\), andprisoners\_dilemma\_in\_the\_matrix\_\_repeated\(dyadic matrix\-game with distinct action ontology and 40×\\times40 RGB\)\. The substrates were chosen*before*training and were not tuned against\. We evaluate persona identifiability with the same protocol as Layer 1 and re\-run the InfoNCE ablation in every substrate\.

Substrate\.commons\_harvest\_\_openis a 7\-player common\-pool resource substrate with88×88×388\\\!\\times\\\!88\\\!\\times\\\!3RGB observations and an 8\-action ontology \(NOOP,FORWARD,BACKWARD,STEP\_LEFT/RIGHT,TURN\_LEFT/RIGHT,FIRE\_ZAP\)\. Two MP\-specific additions are required and otherwise the algorithm matches §[III](https://arxiv.org/html/2605.23652#S3)byte\-for\-byte: \(i\) a 3\-layer Nature\-style CNN \(32×8/4→64×4/2→64×3/132\\\!\\times\\\!8/4\\to 64\\\!\\times\\\!4/2\\to 64\\\!\\times\\\!3/1\) maps the RGB observation to a 256\-d feature that feeds the unchanged FiLM trunk, and \(ii\) a small persona\-agnostic shaping termr~i=ri\+0\.1⋅r¯\\tilde\{r\}\_\{i\}=r\_\{i\}\+0\.1\\cdot\\bar\{r\}\(team\-mean potential\) is added to escape the well\-known apple\-depletion trap on which vanilla PPO collapses to zero return\. All other components—frozen Qwen3\-0\.6B\-Embedding, rank\-1616low\-rank persona projection1024→641024\\\!\\to\\\!64, FiLM trunk 256\-256\-128, 2\-layer GRU trajectory encoder, full\-batch InfoNCE atT=0\.07T\\\!=\\\!0\.07andλIC=0\.5\\lambda\_\{\\text\{IC\}\}\\\!=\\\!0\.5, KL diversity atλ2=0\.1\\lambda\_\{2\}\\\!=\\\!0\.1over88personas×32\\times\\,32states, and PPO \(γ=0\.99\\gamma\\\!=\\\!0\.99, GAE\-λ=0\.95\\lambda\\\!=\\\!0\.95, clip0\.20\.2, entropy0\.010\.01\)—are identical to §[III](https://arxiv.org/html/2605.23652#S3)\. Personas are the same 12\-persona Qwen3 corpus as the companion technical report\[[12](https://arxiv.org/html/2605.23652#bib.bib27)\], split1010train /22held\-out; this section reports in\-distribution metrics on the1010train personas at1M1\\,\\text\{M\}environment steps per seed\. Under the shaping, agents earn modest non\-zero return throughout training \(mean per\-step reward≈0\.010\\approx\\\!0\.010across all 5 seeds, second half of training\), so the substrate is engaged rather than collapsed; commons\_harvest is well known to be hard to fully solve with vanilla PPO and we do not attempt to do so here\.

We measure \(i\) mean pairwise action\-KL between persona\-conditioned policies on6464random rollout states, averaged over the\(102\)=45\\binom\{10\}\{2\}\\\!=\\\!45persona pairs, and \(ii\) trajectory→\\topersona retrieval top\-3 accuracy with the trained GRU encoder over the1010\-train vocabulary \(chance3/10=0\.303/10\\\!=\\\!0\.30\)\. Bootstrap 95% CIs use10 00010\\,000resamples over seeds\.

TABLE VI:Melting Pot multi\-substrate validation \(Tab\. 3\)\.Final\-checkpoint metrics at1M1\\,\\text\{M\}environment steps per seed across three substrates\. Mean pairwise action\-KL \(10 train personas, 45 pairs, 64 random states\) and trajectory→\\topersona retrieval on the 10\-train vocabulary \(chance top\-1=0\.10\\\!=\\\!0\.10, top\-3=0\.30\\\!=\\\!0\.30\)\. The*no\-InfoNCE*ablation collapses retrieval to chance in every substrate while leaving \(or inflating\) pairwise KL—the consistency loss is load\-bearing across all three social\-dilemma categories\. Numbers are mean±\\pmstd over seeds;commons\_harvest\_\_openfull has 5 seeds and a single ablation seed for backward compatibility, the two new substrates have 3 seeds per condition\.Substrate \(category\)Configseedstop\-1top\-3pair KLcommons\_harvest\_\_openFullpcsp50\.564±0\.2090\.564\\\!\\pm\\\!0\.2090\.936±0\.0730\.936\\\!\\pm\\\!0\.0732\.11±0\.702\.11\\\!\\pm\\\!0\.70\(commons\-pool\)−\-\\\!InfoNCE10\.0710\.0710\.2500\.2505\.825\.82clean\_upFullpcsp30\.690±0\.206\\mathbf\{0\.690\\\!\\pm\\\!0\.206\}1\.000±0\.000\\mathbf\{1\.000\\\!\\pm\\\!0\.000\}1\.01±0\.291\.01\\\!\\pm\\\!0\.29\(public\-good\)−\-\\\!InfoNCE30\.095±0\.0410\.095\\\!\\pm\\\!0\.0410\.226±0\.0550\.226\\\!\\pm\\\!0\.0552\.91±1\.122\.91\\\!\\pm\\\!1\.12prisoners\_dilemma\_…\_repeatedFullpcsp30\.625±0\.217\\mathbf\{0\.625\\\!\\pm\\\!0\.217\}1\.000±0\.000\\mathbf\{1\.000\\\!\\pm\\\!0\.000\}2\.87±0\.742\.87\\\!\\pm\\\!0\.74\(dyadic matrix\-game\)−\-\\\!InfoNCE30\.125±0\.1250\.125\\\!\\pm\\\!0\.1250\.167±0\.1440\.167\\\!\\pm\\\!0\.1444\.54±0\.244\.54\\\!\\pm\\\!0\.24chance \(10 train personas\)——0\.100\.100\.300\.300![Refer to caption](https://arxiv.org/html/2605.23652v1/x1.png)Figure 4:Melting Pot single\-substrate detail oncommons\_harvest\_\_open\.Per\-seed bars \(blue: fullpcsp, 5 seeds; red:−\-InfoNCE ablation, 1 seed\)\. Dotted blue line: 5\-seed mean\. \(a\) Mean pairwise action\-KL across the\(102\)=45\\binom\{10\}\{2\}\\\!=\\\!45persona pairs\. \(b\) In\-distribution trajectory→\\topersona top\-3 retrieval on the 10\-train vocabulary \(dashed grey: chance3/103/10\)\. The ablation reaches the highest KL but the lowest top\-3 \(below chance\): policies are different yet trajectories no longer encode*which*persona produced them\. The two additional substrates in Tab\.[VI](https://arxiv.org/html/2605.23652#S6.T6)replicate this pattern\.#### Takeaways\.

The §[III](https://arxiv.org/html/2605.23652#S3)method—applied with no algorithmic modifications beyond the CNN observation encoder required by RGB inputs and \(oncommons\_harvest\_\_open\) the persona\-agnostic team\-reward potential that escapes the apple\-depletion trap—produces persona\-distinguishable behavior across three Melting Pot substrates that differ in social structure \(commons\-pool / public\-good / dyadic\-matrix\), observation geometry, and agent count \(Table[VI](https://arxiv.org/html/2605.23652#S6.T6)\)\. The no\-InfoNCE ablation reproduces the v1–v3 finding in*every*substrate: pairwise KL is held or inflated \(\+38%\+38\\%on commons\_harvest,\+1\.9×\+1\.9\\timeson clean\_up,\+58%\+58\\%on prisoners\_dilemma\) while trajectory→\\topersona retrieval collapses to chance \(0\.100\.10top\-1 in all three new substrates vs\.0\.560\.56–0\.690\.69for the full objective\)\. The consistency loss is what makes the policy*traceable to its conditioning*, not merely diverse, and this property is invariant across the three social\-dilemma categories tested\.

### VI\-AHeld\-out persona recovery and cross\-substrate transfer

The 12\-persona corpus is split1010train /22held\-out \(fast\_mover,spinner\)\. The substrate\-internal results above use the 10\-train vocabulary\. Here we ask two additional Layer\-2 questions that the in\-distribution table cannot: \(i\) does the InfoNCE\-trained manifold extend to the held\-out persona embeddings; and \(ii\) does the manifold learned on one substrate transfer to trajectories collected on a different substrate\.

Protocol\.For each Layer\-2 checkpoint we run a256256\-step rollout in the substrate’s environment, with every agent independently conditioned on a persona embedding drawn from the full 12\-vocab\. Per\-agent trajectories are encoded by the trainedGRUtrajectory encoder and compared against the 12*projected*persona embeddings via cosine similarity; chance top\-1 is1/12≈0\.0831/12\\\!\\approx\\\!0\.083and chance top\-3 is0\.250\.25\. Cross\-substrate transfer betweencommons\_harvest\_\_open\(CH\) andclean\_up\(CU\) reuses the source substrate’s persona projection and trajectory encoder \(zero\-padded on the GRU input weight from88to99action one\-hots when needed; no other parameter is touched\) and applies them to trajectories collected by the target substrate’s policy in the target environment\. Observation shapes match across CH and CU \(88×88×388\\\!\\times\\\!88\\\!\\times\\\!3\);prisoners\_dilemmais excluded from cross\-substrate transfer because of its40×4040\\\!\\times\\\!40RGB\.

TABLE VII:Held\-out\-vocab and cross\-substrate transfer \(Tab\. 4\)\.*Top:*retrieval against the full 12\-persona vocabulary \(chance top\-1=0\.083=0\.083, top\-3=0\.25=0\.25\); seeds match Tab\.[VI](https://arxiv.org/html/2605.23652#S6.T6)\.*Bottom:*cross\-substrate transfer over CH and CU full PCSP checkpoints \(3 seed pairs, chance top\-1=0\.10=0\.10, top\-3=0\.30=0\.30\)\. Mean±\\pmstd over seeds\.Results\.Three things stand out \(Tab\.[VII](https://arxiv.org/html/2605.23652#S6.T7)\)\.*\(i\) Full PCSP generalises to the 12\-vocab\.*Full\-objective retrieval over the 12\-persona vocabulary remains3\.43\.4–4\.9×4\.9\\timeschance for top\-1 and2\.62\.6–3\.6×3\.6\\timeschance for top\-3 in every substrate, and the no\-InfoNCE ablation collapses to chance in every substrate \(top\-1≤0\.071\\leq 0\.071\)\. The mechanism is robust to vocabulary expansion\.*\(ii\) Held\-out persona recovery is the open problem\.*Across all 11 fullpcspruns, the held\-out top\-1 is0\.0000\.000: the two held\-out personas \(fast\_mover,spinner\) never retrieve themselves\. This reproduces the embedding\-margin condition documented in the companion technical report\[[12](https://arxiv.org/html/2605.23652#bib.bib27)\]: the held\-out embeddings sit inside the convex hull of the train embeddings and the trained projection does not separate them\. We flag this as the most direct remaining Layer\-2 gap\.*\(iii\) Cross\-substrate transfer is asymmetric\.*CU→\\toCH yields top\-10\.179±0\.0580\.179\\\!\\pm\\\!0\.058\(1\.79×1\.79\\timeschance\) and top\-30\.429±0\.0770\.429\\\!\\pm\\\!0\.077\(1\.43×1\.43\\timeschance\), showing that the persona manifold learned on the richerclean\_upsubstrate carries information that survives transplantation into the CH environment\. The reverse direction CH→\\toCU is at\-or\-below chance for top\-1 \(0\.060±0\.0340\.060\\\!\\pm\\\!0\.034\) but recovers a non\-trivial top\-3 signal \(0\.417±0\.0340\.417\\\!\\pm\\\!0\.034,1\.39×1\.39\\timeschance\), indicating partial transfer at the rank\-3 level\. The asymmetry—a positive cross\-task signal, but direction\-dependent—is consistent with the InfoNCE manifold’s substrate\-specific shaping; building a substrate\-invariant projection is a clean follow\-up\.

## VIILayer 3 Results: Realtime Engine Deployment \(UE5\)

Layer 3 framing\.A persona\-conditioned policy is only meaningful if it survives the engineering pressure of a real game engine: asynchronous tick rates, NavMesh contention, BT failure recovery, ONNX runtime constraints, and shared world state\. We deploy a*frozen*Layer\-1 \(pcsp\-d\-v3\) checkpoint into Unreal Engine 5\.5 as a hybrid intent stack \(architecture and PIE screenshots in App\.[B](https://arxiv.org/html/2605.23652#A2), Figs\.[13](https://arxiv.org/html/2605.23652#A2.F13)–[15](https://arxiv.org/html/2605.23652#A2.F15)\):pcspselects semantic intents \(e\.g\.EatQuick,FocusedWork,LeisureOutdoor\) and the Behavior Tree, Blackboard, EQS, AIController, and NavMesh execute them\. The policy is exported to ONNX \(20\-action head,3333\-d obs,6464\-d persona projection\) and run through Epic’s NNE/NNERuntimeORT plugin from a customUPCSPPolicySubsystem; persona embeddings are loaded once at level start from the samepersona\_embeddings\.jsonproduced by the research pipeline, and the policy network itself is the unmodifiedpcspcheckpoint \(no engine\-side training\)\. With this stack we ask three questions that Layers 1 and 2 cannot answer: \(i\) does the policy meet a realtime wall\-budget at deployment scale; \(ii\) does the InfoNCE finding survive engine\-side contention; \(iii\) do personas maintain identity over horizons far longer than the training episode\.

Realtime envelope \(summary\)\.A single\-workstation scaling sweep \(\{8,16,32,64,96,128\}\\\{8,16,32,64,96,128\\\}agents×3\\times\\;3seeds×630\\times\\;630s in standalone\-gamemode; full protocol and per\-setting numbers in App\.[D](https://arxiv.org/html/2605.23652#A4), Table[XIV](https://arxiv.org/html/2605.23652#A4.T14)\) establishes that ONNX inference is*not*the bottleneck \(mean per\-call latency183183–202202µs throughn=64n\{=\}64\); frame time scales near\-linearly at∼0\.27\\sim\\\!0\.27ms/agent; and the hard ceiling is NavMeshFindPathsaturation, not policy inference \(BT\-abort rate≤0\.2%\\leq\\\!0\.2\\%forn≤64n\{\\leq\}64,4\.7%4\.7\\%atn=96n\{=\}96,44\.9%44\.9\\%atn=128n\{=\}128\)\. We therefore fixn=64n\\\!=\\\!64as the operating point for every Layer\-3 finding below\.

TABLE VIII:Held\-out persona transfer \(UE5, 64 agents, identical map\)\.Train: personas11–240240; Held\-out:6060test personas \(IDs241241–300300\) covering four occupations absent from training\.ρintra\\rho\_\{\\text\{intra\}\}is inter\-persona action\-histogram Spearman\.Held\-out persona generalization in\-engine\.We ran a clean zero\-shot evaluation with the6060held\-out personas \(IDs 241–300\) inside UE5 after matching environment capacity to the held\-out demand profile \(Office, Hygiene; Table[VIII](https://arxiv.org/html/2605.23652#S7.T8)\)\. Result:2,7922\{,\}792interactions in9\.759\.75min,0\.04% failure rate\(11path\-follow event of2,7922\{,\}792\),43\.643\.6interactions per agent \(vs\.33\.533\.5on train personas\), reward960\.4960\.4\(vs\.730\.6730\.6\), category coverage9/109/10in both\. The inter\-persona action dispersion isρ=0\.368\\rho=0\.368on held\-out personas vs\.0\.3830\.383on train—the engine\-integrated policy is*slightly more*persona\-distinct on personas it has never seen, matching thepcsp\-d\-v3 zero\-shot finding qualitatively\.

TABLE IX:In\-engine ablation at 64 agents,∼\\sim5 min per run\(pcsp\.PolicyModeCVar, identical map and persona set\)\.nintn\_\{\\text\{int\}\}:interaction\_completeevents\. Fail:move\_failed/ total decisions\.ρintra\\rho\_\{\\text\{intra\}\}: mean Spearman over\(642\)\\binom\{64\}\{2\}persona pairs on action histograms\. Sym\. KL is against the referenceHybridPCSPrun\.†BTOnly emits zero logits by construction; KL against HybridPCSP would just measure distance from uniform\.

![Refer to caption](https://arxiv.org/html/2605.23652v1/x2.png)Figure 5:Phase 4 runtime ablation, 64 agents\.Zeroing the persona embedding \(HybridNoPersona\) collapses inter\-persona action dispersion fromρ=0\.37\\rho\\\!=\\\!0\.37toρ=0\.99\\rho\\\!=\\\!0\.99\. Replacing the ONNX policy with the BT\-only needs heuristic \(BTOnly\) further halves throughput and reward as 64 agents synchronise on the single most\-urgent need each tick\.In\-engine ablation\.We re\-ran the persona\-conditioning ablation through a runtime CVar \(pcsp\.PolicyMode\) at 64\-agent scale \(5 min each, Table[IX](https://arxiv.org/html/2605.23652#S7.T9), Fig\.[5](https://arxiv.org/html/2605.23652#S7.F5)\): fullpcspachieved2,0772\{,\}077interactions /0\.0%0\.0\\%failure / reward708\.9708\.9/ action dispersionρ=0\.368\\rho=0\.368; zeroing the persona vector \(HybridNoPersona, the inference\-time analogue of RL\-only\) raised dispersion to0\.9900\.990, increased failure to13\.3%13\.3\\%, and yielded a symmetric KL of1\.051\.05against the full\-PCSP policy distribution; bypassing ONNX entirely \(BTOnly, needs heuristic\) collapsed to1,1521\{,\}152interactions /87\.6%87\.6\\%failure / dispersion0\.9890\.989\. Persona conditioning is load\-bearing in the engine under the same metric—action dispersion collapses by2\.7×2\.7\\timeswhen the persona signal is removed—reproducing the §[V](https://arxiv.org/html/2605.23652#S5)pattern with no algorithmic change\. To calibrate the KL axis, two back\-to\-back PCSP runs on the same map and persona set yielded a per\-persona symmetric KL of0\.710\.71nats \(median0\.480\.48, Spearmanρ=0\.65\\rho=0\.65\), so the HybridNoPersona KL of1\.051\.05is already1\.5×1\.5\\timesthis paired\-run noise floor\.

Consistency\-loss replication in\-engine\.We additionally exported the v3no\_consistcheckpoint to ONNX and ran paired 64\-agent PIE sessions under identicalHybridPCSPmode but with the two different weight files\. Full reached3,1103\{,\}110interactions in658658s with reward1,079\.81\{,\}079\.8and inter\-persona actionρ=0\.379\\rho=0\.379;no\_consistreached4,0054\{,\}005interactions in681681s with reward1,423\.51\{,\}423\.5andρ=0\.312\\rho=0\.312\. The two checkpoints choose meaningfully different actions per persona at matched conditioning—mean Spearmanρmatch=0\.348\\rho\_\{\\text\{match\}\}=0\.348and mean symmetric KL of1\.791\.79nats across the6464personas—yet aggregate task reward is*higher*for the checkpoint without the consistency loss\. This is the in\-engine replication of the §[V](https://arxiv.org/html/2605.23652#S5)“reward hides the failure” result\.

### VII\-ALong\-horizon behavioural persistence

Do personas hold their routines past the training horizon?The Layer\-1 training episode is128128environment steps \(intent decisions\)\. To test whether persona identity persists over horizons far longer than the policy ever saw at training time — and under a contention level low enough that the signal is not masked by zone\-capacity collisions \(cf\. §[VII\-B](https://arxiv.org/html/2605.23652#S7.SS2)\) — we ran a dedicated low\-density session:88agents in standalone\-gameforpcsp\.RunDurationSeconds=1800\{=\}1800\(ue/cnzoi/Saved/PCSP/Logs/20260520\_102022,1,7941\{,\}794s,0\.0%0\.0\\%BT\-abort\)\. The eight agents are pinned to four personas \(two agents each\) chosen*a priori*to maximise pairwise category\-distribution sym\-KL on an independent clean reference session \(20260518\_114841; minimum pairwise sym\-KL among the four=2\.68=2\.68nats\):p001\(Social\-leaning\),p009\(Rest\-leaning\),p041\(high\-entropy\), andp058\(Work\-leaning\)\. Each agent issues119119–291291sequential intent decisions over the window \(0\.90\.9–2\.3×2\.3\\timesthe128128\-step training episode\), spanning3030minutes of wall\-clock realtime\. For each persona we bin everydecision/interaction\_completeevent into11\-minute windows and report the dominant intent category per bin \(Fig\.[6](https://arxiv.org/html/2605.23652#S7.F6), derived byresearch/scripts/build\_persona\_persistence\.py; raw bin sequence and per\-bin category histograms inresearch/results/ue\_sessions/20260520\_102022/persona\_persistence\.json\)\.

![Refer to caption](https://arxiv.org/html/2605.23652v1/x3.png)Figure 6:Long\-horizon persona persistence \(Layer 3\)\.Per\-minute dominant intent category for four maximally\-separated personas over3030in\-game minutes in UE5 \(88agents,*frozen*Layer\-1 checkpoint,0\.50\.5s decision throttle,0\.0%0\.0\\%BT\-abort\)\. Each row is one persona; each column is a11\-minute bin coloured by the modal intent category\.p009stays inRestfor all3030bins \(single run\);p058holdsWorkas its modal axis in17/3017/30bins;p001is Social\-dominant \(15/3015/30bins\) while cycling through five categories;p041splits evenly betweenSocialandWork\. No persona ever “forgets” its dominant axis over the3030\-minute window\.Three observations\. \(i\)*Persona\-level stability holds past the training horizon\.*The top\-category share over the3030\-minute window is1\.001\.00/0\.570\.57/0\.500\.50/0\.470\.47forp009/p058/p001/p041respectively—the persona ordering by “focus” is preserved end\-to\-end, and the most\-focused persona \(p009,291291decisions/agent\) stays on its preferred axis for every single bin\. \(ii\)*Persistence is persona\-specific, not flat\.*p001visits five distinct categories with1818inter\-bin transitions;p009executes a single3030\-binRestrun\. The figure therefore shows persona persistence*without*flattening into “every persona does its top thing forever”—which would be indistinguishable from a learned task\-side bias\. \(iii\)*The mechanism is the conditioning, not a recurrent state\.*The policy is stateless feed\-forward over the per\-step observation; the only persona\-side memory is the frozen6464\-d projection\. Behavioural persistence over3030minutes of realtime is thus purely a property of the persona embedding plus the InfoNCE\-trained conditioning manifold, not of any in\-engine temporal smoothing\.

### VII\-BFailure analysis and contention: theρ\\rho\-drop as a Layer\-3 finding

BT failure taxonomy\.The hybrid stack distinguishes three engine\-side failure modes, all emitted as Behaviour\-Tree aborts that the offline analyser bins byfailure\_reason\(Table[X](https://arxiv.org/html/2605.23652#S7.T10)\)\. Across nine 64\-agent PIE sessions \(including both held\-out and runtime ablation conditions\),FindBestZone:AllOverCapacity—every candidate zone for the chosen intent already at capacity—accounts for the dominant share of aborts in contention\-heavy regimes \(e\.g\. 87\.6% of decisions in the BTOnly run, where6464agents synchronise on the most\-urgent need each tick\)\. Under the fullpcsp\-d\-conditioned stack with tuned capacities this collapses to<2<\\\!2% of decisions, with the residual beingzone\_no\_free\_interaction\_point\(a chosen zone fills between selection and arrival\) andpath\_follow\_idle\_short\(a NavMesh follower stalls below a movement threshold\)\. Critically, theHybridNoPersonaablation raises the abort rate from0\.0%0\.0\\%to13\.3%13\.3\\%*without changing the map*—the persona signal is what desynchronises agents into different categories and keeps the contention budget liveable\.

Where the contention lands\.Figure[7](https://arxiv.org/html/2605.23652#S7.F7)traces zone\-capacity utilisation \(occupants//capacity\) across a representative6464\-agent HybridPCSP session \(3,8813\{,\}881decisions,3,7283\{,\}728completed interactions over629629s\)\. Utilisation is non\-uniform: the Rest, Work, Hygiene, and Exercise zones peak near0\.650\.65and Rest sits at a0\.510\.51time\-mean, while other zones stay slack\. This unevenness is the engine\-side mechanism that compresses the persona signal — Figure[8](https://arxiv.org/html/2605.23652#S7.F8)contrasts the policy’s*preferred*intent distribution \(mean softmax over the2020\-d logits, folded to the1111affordance categories\) with what is*expressed*\(the categories whose interactions actually complete\)\. The two distributions diverge sharply \(symmetric KL=9\.07=9\.07nats\): the policy’s largest preferences, Leisure \(0\.340\.34\) and Study \(0\.290\.29\), collapse at execution \(0\.000\.00and0\.040\.04\), while the reachable Rest, Social, and Work categories absorb the displaced mass \(0\.03→0\.480\.03\\\!\\to\\\!0\.48,0\.11→0\.310\.11\\\!\\to\\\!0\.31,0\.03→0\.110\.03\\\!\\to\\\!0\.11\)\. This is the same “pushed toward what is reachable” effect made concrete: the BT’s failure\-recovery decorators reroute an unsatisfiable intent to whichever affordance is free, attenuating the embedding\-to\-action pathway that Layer 1 measures in isolation\.

TABLE X:BT\-abort taxonomy at6464agents\.Decision counts and abort categories for representative sessions;FindBestZoneis the contention\-induced “every candidate zone is full” abort,zone\_no\_free\_interaction\_pointis the selection\-to\-arrival race,path\_follow\_idle\_shortis a NavMesh follower stall\. The same persona\-removal ablation that inflates dispersion toρ=0\.99\\rho\\\!=\\\!0\.99also moves the failure mass from∼0%\\sim\\\!0\\%intoFindBestZone\-class aborts, naming contention as the engine\-side mechanism behind theρ\\rho\-drop\.![Refer to caption](https://arxiv.org/html/2605.23652v1/x4.png)Figure 7:Zone\-capacity utilisation over a6464\-agent HybridPCSP episode\.Per\-zone occupants//capacity \(rows\) across3030equal time bins over the∼10\.5\\sim\\\!10\.5\-min episode \(columns\)\. Rest, Work, Hygiene, and Exercise zones carry the load \(peaks≈0\.65\\approx\\\!0\.65\); the unevenness is the engine\-side contention behind theρ\\rho\-drop\.![Refer to caption](https://arxiv.org/html/2605.23652v1/x5.png)Figure 8:Expressed vs\. preferred intent under engine contention\.Policy\-preferred category distribution \(mean softmax over the2020\-d logits, folded to1111categories\) against the categories whose interactions actually complete, aggregated over6464agents\. The policy’s top preferences \(Leisure, Study\) collapse at execution while Rest/Social/Work absorb the displaced mass \(symmetric KL=9\.07=9\.07nats\) — agents are rerouted to reachable affordances\.Engine\-side persona–action alignment\.We additionally test thepcsp\-dalignment metric inside the engine: for each session we compute pairwise cosine distance over the6464\-d projected persona embedding and pairwise symmetric KL over the per\-persona mean softmax\(logits\), and report Spearmanρ\\rhoacross the\(642\)=2,016\\binom\{64\}\{2\}\\\!=\\\!2\{,\}016persona pairs\. Full PCSP yields engine\-sideρ∈\[0\.236,0\.257\]\\rho\\in\[0\.236,0\.257\]across two paired sessions, theno\_consistcheckpoint yieldsρ=0\.569\\rho=0\.569, andBTOnlyyieldsρ≈0\\rho\\approx 0\(the policy emits zero logits by construction\)\. All four values are well below thepcsp\-dresearch\-sideρ≈0\.73\\rho\\approx 0\.73\. The hybrid stack itself imposes a ceiling: capacity contention and the BT’s failure\-recovery decorators push agents toward what is reachable, not what their embedding most prefers, compressing the persona signal at execution time\. We treat this gap as a quantified engine–research mismatch, not a refutation of the in\-research result\.

Per\-decision JSONL traces, the live Gameplay Debugger overlay \(Fig\.[15](https://arxiv.org/html/2605.23652#A2.F15)\), and a∼2,000\\sim\\\!2\{,\}000\-line C\+\+ integration make the consistency\-loss ablation re\-runnable inside the engine without re\-instrumenting; full observability instrumentation, trace schema, and implementation\-cost breakdown are described in App\.[C](https://arxiv.org/html/2605.23652#A3)\.

### VII\-CEmergent social structure

![Refer to caption](https://arxiv.org/html/2605.23652v1/x6.png)Figure 9:Co\-interaction graph over the6464\-agent HybridPCSP episode\.Nodes are agents, coloured by behavioural archetype \(modal expressed category\); an edge joins two agents whenever their zone\-occupancy intervals overlap*and*their interaction points lie within250250world units \(same/adjacent seat\)\. Edge opacity and width scale with total shared\-zone overlap; node size scales with weighted degree\. Same\-archetype agents co\-locate well above chance \(assortativity0\.360\.36,63\.5%63\.5\\%of edges same\-archetype\)\.The hybrid stack records no explicit social interactions, yet a co\-presence structure emerges purely from6464agents independently choosing where to spend time under a shared world\. From the per\-agent traces we reconstruct each agent’s zone\-occupancy intervals \(adecision→\\rightarrowinteraction\_completepair fixes a zone category, a time window, and an interaction\-point position\) and draw a co\-interaction edge between two agents whenever their intervals overlap in the same zone*and*their interaction points fall within250250world units — a genuine shared\-seat criterion rather than mere co\-membership of a large zone, which a single3636\-capacity zone would otherwise saturate into a near\-complete graph\. The resulting graph over3,7283\{,\}728completed interactions \(Fig\.[9](https://arxiv.org/html/2605.23652#S7.F9)\) has672672edges \(density0\.330\.33, mean weighted degree2121\) and is*archetype\-assortative*: the attribute assortativity coefficient is0\.360\.36and63\.5%63\.5\\%of edges join agents of the same modal archetype, far above the mixing expected if routine choice were persona\-independent\. The effect is robust to the co\-location threshold and strengthens monotonically as the criterion tightens — assortativity rises0\.13→0\.26→0\.360\.13\\rightarrow 0\.26\\rightarrow 0\.36as the radius contracts600→400→250600\\rightarrow 400\\rightarrow 250units — confirming that the structure is driven by genuine physical co\-location of like\-persona agents, not by coarse zone coincidence\. No social objective, reward term, or coordination signal is present; the clustering is an emergent consequence of persona\-conditioned activity selection expressed through the shared engine, and it is the kind of mature deployment\-side analysis that Layers 1 and 2 cannot produce\.

## VIIIDiscussion and Future Work

Dynamic personas and memory\.Currentpcspencodes a static persona text once at NPC creation\. Life\-simulation characters should also change: a repeated social conflict may increase neuroticism, a long friendship may shift agreeableness, and a remembered event may alter future action preferences\. A natural extension is to separate stable persona traits from dynamic memory state, using a lightweight event\-conditioned update model or retrieval\-augmented memory module that can influence the shared policy without requiring per\-step LLM inference\[[19](https://arxiv.org/html/2605.23652#bib.bib8)\]\. Persona\-grounded dialog corpora\[[38](https://arxiv.org/html/2605.23652#bib.bib42)\]provide a complementary signal for evolving personas from social interaction\.

Scaling to richer environments\.The Melting Pot evidence in §[VI](https://arxiv.org/html/2605.23652#S6)now spans three substrates across distinct social\-dilemma categories and includes both held\-out\-vocabulary retrieval and cross\-substrate transfer \(§[VI\-A](https://arxiv.org/html/2605.23652#S6.SS1), Tab\.[VII](https://arxiv.org/html/2605.23652#S6.T7)\)\. Two open problems remain\.*Held\-out persona recovery*fails consistently—across1111fullpcspruns, neither held\-out persona is retrieved at rank 1—reproducing the embedding\-margin condition documented in the companion technical report\[[12](https://arxiv.org/html/2605.23652#bib.bib27)\]; the natural mechanistic targets are multi\-objective InfoNCE with per\-cluster pinning, projection\-head re\-seeding, and a controlled\-margin held\-out corpus\.*Cross\-substrate transfer*is positive but asymmetric \(CU→\\toCH1\.79×1\.79\\timeschance top\-1, CH→\\toCU at\-or\-below chance top\-1 but1\.39×1\.39\\timeschance top\-3\), pointing to a substrate\-invariant projection as the next architectural target\. The UE5 deployment in §[VII](https://arxiv.org/html/2605.23652#S7)shows that the sub\-frame inference profile and the persona\-conditioning ablation both survive the move out of the Python benchmark; what remains is commercial\-scale multi\-agent validation in worlds substantially richer than a single district, including procedurally\-generated suites\[[6](https://arxiv.org/html/2605.23652#bib.bib36)\]that stress persona\-policy generalization beyond a fixed training map\.

Human evaluation methodology\.Automatic metrics are necessary for ablations but insufficient for perceived NPC believability\. Our 30\-participant coarse\-trace pilot shows above\-chance persona identification, but also many ambiguous or misleading items\. Stronger human evaluation will require an instrument that preserves per\-participant responses, so that variance, confidence, and inter\-rater agreement can be analyzed alongside aggregate accuracy\.

## IXLimitations and Scope

pcsphas several limitations that bound the interpretation of the results\.

Deliberate minimality of Layer 1\.pcsp\-dis intentionally simpler than commercial life\-simulation worlds\. Behavioral realism is not its purpose; it is the layer at which we can run controlled InfoNCE ablations across three independent environment instantiations and thousands of held\-out personas\. Realism claims in this paper are grounded in Layers 2 \(Melting Pot, §[VI](https://arxiv.org/html/2605.23652#S6)\) and 3 \(UE5, §[VII](https://arxiv.org/html/2605.23652#S7)\), where realism, contention, and asynchrony are present and where the same checkpoint is shown to behave consistently\. Commercial\-scale multi\-agent validation, continuous physics, longer horizons, asynchronous events, and richer social affordances beyond Layer 3 remain untested\.

Action observability\.The original 12\-action space exposes only a thin slice of persona\-relevant behavior\. In the 30\-participant coarse\-trace pilot, aggregate 2AFC accuracy was above chance, but 13 of 30 items were ambiguous or misleading\. Coarse labels such asread,rest, andeatare difficult to interpret without temporal, spatial, and social context\.pcsp\-d\-v3 and the rich rollout renderer address this partly, but finer social and stylistic action semantics remain needed\.

Synthetic personas\.Training personas are generated from Big Five archetypes and occupation templates\. The designer\-authored case study in §[V\-E](https://arxiv.org/html/2605.23652#S5.SS5)tests 50 additional handwritten personas across five sources, but does not establish robustness to the nuance, inconsistency, or cultural specificity of production\-authored characters\.

Limited aggregate human evaluation\.The completed Google Forms pilot collected only item\-level A/B selection ratios for coarse traces\. It supports aggregate 2AFC accuracy analysis, but not participant\-level variance, confidence, response time, order effects, or inter\-rater reliability\. A more instrumented human study remains future work\.

Engine–research alignment gap \(analysis in §[VII\-B](https://arxiv.org/html/2605.23652#S7.SS2)\)\.We do not treat the Layer\-1→\\toLayer\-3 drop in absolute alignment \(ρ≈0\.73\\rho\\approx 0\.73inpcsp\-dvs\.ρ∈\[0\.236,0\.257\]\\rho\\in\[0\.236,\\,0\.257\]in UE5\) as a method failure\. The persona\-conditioning ablation and the InfoNCE consistency\-loss replication both remain visible*inside the engine*\(§[VII\-B](https://arxiv.org/html/2605.23652#S7.SS2)\); the structural cause is engine\-side contention, not loss of method validity\. What remains open is whether a less congested map narrows the absolute gap, and how to apportion the residual between hybrid\-stack ceiling and intrinsic deployment\-noise\.

## XConclusion

Life simulation games create a scaling challenge that the game AI field has not yet systematically addressed: how to give hundreds to thousands of NPCs distinct, consistent, and controllable personalities without proportional authoring or inference cost\. We show that*persona\-conditioned shared policies*—a single RL policy conditioned on frozen LLM persona embeddings—can satisfy the four axes that matter for practical deployment: persona consistency, natural\-language controllability, zero\-shot generalization, and real\-time inference\.

Across threepcsp\-dsettings,pcspachieves up to 17×\\timesabove\-chance*compositional*zero\-shot persona identification \(unseen\-occupation held\-out crosses\),ρ≈0\.73\\rho\\\!\\approx\\\!0\.73semantic\-behavioral alignment, and 22×\\timesfaster inference than an LLM\-as\-policy baseline\. We separate this regime from*vocabulary\-expansion*held\-out evaluation in Melting Pot \(§[VI\-A](https://arxiv.org/html/2605.23652#S6.SS1)\), where top\-1 retrieval on the two genuinely held\-out persona tokens remains at0across1111runs and which we flag as the most direct open problem\. External validation on threeMelting Pot 2\.4\.0substrates \(commons\_harvest\_\_open,clean\_up,prisoners\_dilemma\_in\_the\_matrix\_\_repeated\) confirms that the same method produces persona\-conditioned behavioral divergence across commons\-pool, public\-good, and dyadic\-matrix social dilemmas, with the no\-InfoNCE ablation collapsing trajectory→\\topersona retrieval to chance in every substrate\. The main empirical finding is that the InfoNCE trajectory\-consistency objective is load\-bearing: removing it collapses zero\-shot persona traceability to chance even when reward improves\. The next steps are to couple static personas with dynamic memory, test cross\-substrate persona transfer and held\-out\-persona recovery against the mechanistic targets identified in the companion technical report\[[12](https://arxiv.org/html/2605.23652#bib.bib27)\], and complete human studies using rich trajectory traces\.

## Appendix ALayer 1 base\-scale \(v1\), expanded\-scale \(v2\), and v3\-large replications

The v3 results in §[V](https://arxiv.org/html/2605.23652#S5)are the primary Layer\-1 evidence\. We include here the v1 \(base\-scale, 12\-action\), v2 \(expanded\-scale, 12\-action\), and v3\-large \(expanded\-scale, 20\-action\) replications that establish the InfoNCE consistency objective as load\-bearing across grid size, agent count, persona\-set size, and action ontology\. The protocols are described in the Experimental Setup paragraphs of §[V](https://arxiv.org/html/2605.23652#S5); metrics, baselines, and chance levels are unchanged\. v3\-large is the only one of the three with 3\-seed aggregation; v1 and v2 report point estimates from a single seed\.

![Refer to caption](https://arxiv.org/html/2605.23652v1/figures/fig2_learning_v1v2.png)Figure 10:Training reward curves at two scales\.\(a\) v1 \(6×\\times6, 4 agents, 300 iters\)\. \(b\) v2 \(12×\\times12, 16 agents, 200 iters\)\. Removing the consistency loss \(no\_consist\) preserves or even slightly increases reward but collapses zero\-shot persona traceability across all scales \(Tables[XI](https://arxiv.org/html/2605.23652#A1.T11)–[III](https://arxiv.org/html/2605.23652#S5.T3)\)\. Removing the diversity loss \(no\_diverse\) causes KL collapse at v1 and v2 \(0\.390\.39and0\.480\.48\) and a v1 reward drop, but its effect on v3 zero\-shot accuracy is within the Wilson CI of full \(§[V](https://arxiv.org/html/2605.23652#S5)\)\.TABLE XI:v1 results\(6×\\times6, 4 agents, 300 personas\)\. ZS Acc: zero\-shotkk\-NN accuracy on 60 unseen personas \(random≈1\.7%\\approx\\\!1\.7\\%\)\.ρ\\rho: Spearman \(projected persona dist\. vs\. KL\)\.\* Near\-random \(failure mode\)\.†KL≈0\.39\\approx\\\!0\.39;ρ\\rhonot meaningful\.

TABLE XII:v2 results\(12×\\times12, 16 agents, 500 personas\)\. ZS Acc on 100 unseen personas \(random=1\.0%=\\\!1\.0\\%\)\. Lower absolute reward reflects the harder, larger environment\.\* Exactly 0 \(total failure\)\.†KL too low;ρ\\rhonot meaningful\.‡High KL without consistency reflects unstructured divergence, not persona alignment\.

TABLE XIII:v3\-large results\(12×\\times12, 16 agents, 500 personas, 20\-action ontology; 400 train / 100 zero\-shot test\)\. 3 seeds per arm \(mean±\\pmstd\)\. ZS Acc on 100 unseen personas \(random=1\.0%=\\\!1\.0\\%\)\. Coh\.: trajectory coherence ratio \(intra/inter persona cosine similarity\)\.\* At random chance \(0\.010\)\. The consistency loss is held at zero during training, so trajectory and persona embeddings receive no alignment signal; coherence ratio collapses to 1\.05 \(intra≈\\approxinter\), reproducing the v3\-base finding at 4×\\timesgrid, 4×\\timesagents, 5/3×\\timespersonas, under the 20\-action ontology\.

![Refer to caption](https://arxiv.org/html/2605.23652v1/figures/fig4_zeroshot.png)Figure 11:Zero\-shot generalization on 60 unseen personas \(v1\)\.pcsp\(full\) achieves 19\.3% trajectory\-to\-persona identification; the red dashed line marks random chance \(1\.7%\)\. Removing the consistency loss collapses accuracy to 1\.7%\. The v2 experiment \(100 unseen personas\) replicates this qualitative pattern\.![Refer to caption](https://arxiv.org/html/2605.23652v1/figures/fig3_kl_v1v2.png)Figure 12:Empirical projected persona distance vs\. behavioral KL divergence at two scales \(PCSP full\)\.Each point is a recomputed persona pair from the trained policy checkpoints rather than a synthetic reconstruction from summary statistics\. Left: v1 sampled\-pairρ=0\.731\\rho\\\!=\\\!0\.731\(100 pairs, 200 states\)\. Right: v2 sampled\-pairρ=0\.695\\rho\\\!=\\\!0\.695\(60 pairs, 100 states\)\. The monotone alignment between persona space and behavior space is preserved when moving to a 4×\\timeslarger grid with 4×\\timesmore agents and 67% more personas\.
## Appendix BUE5 System Architecture

This appendix collects the engine\-integration scaffolding for the Layer\-3 deployment whose*scientific*findings are reported in §[VII](https://arxiv.org/html/2605.23652#S7)\. Figure[13](https://arxiv.org/html/2605.23652#A2.F13)shows the research / engine split: the Python pipeline trainspcspand exports a single ONNX actor plus a JSON of L2\-normalised persona projections; Unreal loads both intoUPCSPPolicySubsystemand runs synchronous inference inside a Blackboard\-driven Behaviour Tree; per\-decision JSONL traces feed the offline analysis scripts\. No engine\-side training\. Figure[14](https://arxiv.org/html/2605.23652#A2.F14)showsMap\_PCSPDistrict\_Min PIE with the affordance\-zone layout used throughout the Layer\-3 evaluation, and Fig\.[15](https://arxiv.org/html/2605.23652#A2.F15)shows the live Gameplay Debugger overlay that exposes per\-agent Blackboard and BT state at runtime\.

![Refer to caption](https://arxiv.org/html/2605.23652v1/figures/Research-UE5-split.png)Figure 13:Research–UE5 split\.The Python pipeline \(left\) trainspcspand exports a single ONNX actor plus a JSON of L2\-normalised persona projections\. Unreal \(centre\) loads both intoUPCSPPolicySubsystemand runs synchronous inference inside a Blackboard\-driven Behaviour Tree; per\-decision JSONL traces feed the offline analysis scripts \(right\)\. No engine\-side training\.![Refer to caption](https://arxiv.org/html/2605.23652v1/figures/ue5-topdown.png)Figure 14:Map\_PCSPDistrict\_Min PIE\.Top\-down view with affordance\-zone labels \(SocialHub, Library/Study, Office/Work, Bedroom/Rest, …\); each green sphere is anAPCSPAgentCharacterexecuting the hybrid PCSP/BT stack\. The map has1010zones spanning the v3 affordance taxonomy \(Kitchen, Bedroom, Office, Library, Gym, Bathroom, SocialHub, Park, Shop; Leisure folded into Observe\)\.![Refer to caption](https://arxiv.org/html/2605.23652v1/figures/ue5-x6.png)Figure 15:Gameplay Debugger overlay on a single agent\.Yellow text \(right\) is the live Blackboard / BT snapshot for the selectedAPCSPAgentCharacter: currentDesiredActionType,UrgencyScore,bAffordanceReserved, and the active BT node\. This view confirms that the persona branch is routing decisions throughBTTask\_PCSPDecision→\\toMoveToAffordance→\\toPerformInteractionand that the emergency decorator fires only whenUrgencyScore≥0\.85\\geq\\\!0\.85\.
## Appendix CInference Pipeline, Observability, and Implementation Cost

This appendix documents the engine\-side instrumentation and the integration footprint that supports the Layer\-3 findings in §[VII](https://arxiv.org/html/2605.23652#S7)\. The key claim is reproducibility: the same consistency\-loss ablation that defines the paper’s central Layer\-1 finding can be re\-run end\-to\-end inside the engine without re\-instrumenting, on top of an integration that is small enough to be audited in a single sitting\.

Observability\.In addition to the offline JSONL traces, the live Gameplay Debugger \(Fig\.[15](https://arxiv.org/html/2605.23652#A2.F15)\) exposes per\-agent Blackboard and BT state at runtime\. Every agent emits a per\-decision JSONL trace \(action,2020\-d softmax logits,88\-d needs snapshot, urgency, reward, zone selection diagnostic, failure reason\)\. The sameanalyze\_ue\_session\.pypipeline that aggregates these traces also computes per\-persona symmetric KL between paired PIE sessions, so the consistency\-loss ablation that defines the paper’s central finding can be re\-run inside the engine without re\-instrumenting\.

Implementation cost\.The engine integration is deliberately lightweight:∼2,000\\sim\\\!2\{,\}000lines of C\+\+ acrossSource/cnzoi/PCSP/\(one world subsystem, five agent components, three BT tasks, an agent character\), one∼4\\sim\\\!4MB ONNX file plus a∼150\\sim\\\!150KB persona\-embedding JSON, and zero engine\-side training\. The Python pipeline remains the single source of truth for both the policy and the persona embeddings\. The 64\-agent stress test runs at≥60\\geq\\\!60FPS in PIE on a single consumer GPU/CPU; inference is synchronous on the BT tick with a0\.50\.5s decision throttle and exponential backoff on repeated move failures, capping ONNX calls at≤128/s\\leq\\\!128/\\text\{s\}at this scale—comfortably within the NNERuntimeORT CPU path’s headroom without async batching\.

## Appendix DRealtime Scaling Sweep

This appendix reports the single\-workstation scaling sweep that establishes the realtime envelope of the hybrid stack and identifies NavMeshFindPathsaturation \(not policy inference\) as the hard ceiling\. Table[XIV](https://arxiv.org/html/2605.23652#A4.T14)is the source data behind the “n≤64n\\\!\\leq\\\!64recommended,n=96n\\\!=\\\!96soft cap,n≥128n\\\!\\geq\\\!128requires async batched pathfinding ” guidance referenced in §[VII](https://arxiv.org/html/2605.23652#S7)\.

Protocol\.1818runs total \(research/results/ue\_sessions/scaling\_20260520/\)\. All runs share the same ONNX policy, the same0\.50\.5s decision throttle with urgency\-driven preemption, and a6060FPS wall budget\. The sweep is fully automated bytools/run\_scaling\_sweep\.ps1, which sets\-PCSP\_AgentCount/SpawnSeed/RunDurationSecondson the command line \(FParse::Valuereads these before anyBeginPlay\); auto\-quit fires fromFTSTicker::GetCoreTickerinsideUPCSPPerfSamplerSubsystem, and a\+90\+90s PowerShell watchdog guarantees forward progress\.

Observations\.\(i\)*ONNX inference is not the bottleneck\.*Mean per\-call latency stays at183183–202202µs throughn=64n\{=\}64; the apparent drop to154154µs \(n=96n\{=\}96\) and132132µs \(n=128n\{=\}128\) reflects CPU\-scheduler timeslicing at saturation, not model speedup\. \(ii\)*Frame time scales near\-linearly at∼0\.27\\sim\\\!0\.27ms/agent*, with thep95p95frame budget intact throughn=96n\{=\}96\(15\.915\.9ms\) and only0\.40\.4ms over the16\.6716\.67ms6060FPS budget atn=128n\{=\}128\. \(iii\)*NavMesh pathfinding is the hard ceiling\.*BT\-abort failure rate is≤0\.2%\\leq\\\!0\.2\\%forn≤64n\{\\leq\}64, rises to4\.7%4\.7\\%atn=96n\{=\}96, and collapses to44\.9%\\mathbf\{44\.9\\%\}atn=128n\{=\}128as concurrentFindPathrequests exceed the Recast query queue’s async capacity \(synchronous BTMoveToAffordance\)\. The recommended real\-time operating point is𝒏≤𝟔𝟒\\boldsymbol\{n\{\\leq\}64\};9696is a soft cap;128\+128\{\+\}requires async batched pathfinding or a crowd\-simulation fallback\. Intent throughput is stable at5\.65\.6–6\.16\.1intents/agent/min forn≤64n\{\\leq\}64— per\-agent decision rate is invariant under crowd scaling\.

TABLE XIV:UE5 scaling sweep \(Fig\.[14](https://arxiv.org/html/2605.23652#A2.F14);1818runs\)\.\{8,16,32,64,96,128\}\\\{8,16,32,64,96,128\\\}agents×3\\times\\;3seeds×630\\times\\;630s onMap\_PCSPDistrict\_Min standalone\-gamemode\. All runs use the same frozen v3 ONNX policy, identical zone capacities,0\.50\.5s decision throttle, and6060FPS wall budget\. Inference latency stays well under the250250µs per\-agent budget throughn=64n\{=\}64; the hard ceiling is NavMeshFindPathsaturation atn≥96n\{\\geq\}96, not policy inference\. Source:latency\_budget\.tsvfromresearch/scripts/analyze\_scaling\_sweep\.py\.
## References

- \[1\]M\. Andrychowicz, F\. Wolski, A\. Ray, J\. Schneider, R\. Fong, P\. Welinder, B\. McGrew, J\. Tobin, P\. Abbeel, and W\. Zaremba\(2017\)Hindsight experience replay\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]N\. Bard, J\. N\. Foerster, S\. Chandar, N\. Burch, M\. Lanctot, H\. F\. Song, E\. Parisotto, V\. Dumoulin, S\. Moitra, E\. Hughes,et al\.\(2020\)The Hanabi challenge: a new frontier for AI research\.Artificial Intelligence280\.Cited by:[§VI](https://arxiv.org/html/2605.23652#S6.p1.1)\.
- \[3\]C\. Berner, G\. Brockman, B\. Chan, V\. Cheung, P\. Debiak, C\. Dennison, D\. Farhi, Q\. Fischer, S\. Hashme, C\. Hesse,et al\.\(2019\)Dota 2 with large scale deep reinforcement learning\.arXiv preprint arXiv:1912\.06680\.Cited by:[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px4.p1.1)\.
- \[4\]L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch\(2021\)Decision transformer: reinforcement learning via sequence modeling\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px2.p1.1)\.
- \[5\]M\. Chevalier\-Boisvert, D\. Bahdanau, S\. Lahlou, L\. Willems, C\. Saharia, T\. H\. Nguyen, and Y\. Bengio\(2019\)BabyAI: a platform to study the sample efficiency of grounded language learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]K\. Cobbe, C\. Hesse, J\. Hilton, and J\. Schulman\(2020\)Leveraging procedural generation to benchmark reinforcement learning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§VIII](https://arxiv.org/html/2605.23652#S8.p2.5)\.
- \[7\]M\. Colledanchise and P\. Ögren\(2018\)Behavior trees in robotics and AI: an introduction\.CRC Press\.Cited by:[§I\-A](https://arxiv.org/html/2605.23652#S1.SS1.SSS0.Px1.p1.1)\.
- \[8\]C\. S\. de Witt, T\. Gupta, D\. Makoviichuk, V\. Makoviychuk, P\. H\. S\. Torr, M\. Sun, and S\. Whiteson\(2020\)Is independent learning all you need in the StarCraft multi\-agent challenge?\.arXiv preprint arXiv:2011\.09533\.Cited by:[§III\-C](https://arxiv.org/html/2605.23652#S3.SS3.p1.6)\.
- \[9\]B\. Eysenbach, A\. Gupta, J\. Ibarz, and S\. Levine\(2019\)Diversity is all you need: learning skills without a reward function\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§I\-A](https://arxiv.org/html/2605.23652#S1.SS1.SSS0.Px4.p1.1),[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px3.p1.1)\.
- \[10\]J\. N\. Foerster, G\. Farquhar, T\. Afouras, N\. Nardelli, and S\. Whiteson\(2018\)Counterfactual multi\-agent policy gradients\.InAAAI Conference on Artificial Intelligence,Cited by:[§VI](https://arxiv.org/html/2605.23652#S6.p1.1)\.
- \[11\]L\. R\. Goldberg\(1990\)An alternative “description of personality”: the Big\-Five factor structure\.Journal of Personality and Social Psychology59\(6\),pp\. 1216–1229\.Cited by:[§IV\-A](https://arxiv.org/html/2605.23652#S4.SS1.p2.4)\.
- \[12\]Y\. Hong\(2026\)Phase 5 report: persona\-conditioned shared policies on melting pot\.Note:Project report,research/meltingpot/PHASE5\_REPORT\.mdCited by:[§X](https://arxiv.org/html/2605.23652#S10.p2.6),[§VI\-A](https://arxiv.org/html/2605.23652#S6.SS1.p3.15),[§VI](https://arxiv.org/html/2605.23652#S6.p2.19),[§VIII](https://arxiv.org/html/2605.23652#S8.p2.5)\.
- \[13\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§III\-A](https://arxiv.org/html/2605.23652#S3.SS1.p2.7)\.
- \[14\]M\. Laskin, H\. Liu, X\. B\. Peng, D\. Yarats, A\. Rajeswaran, and P\. Abbeel\(2022\)CIC: contrastive intrinsic control for unsupervised skill discovery\.arXiv preprint arXiv:2202\.00161\.Cited by:[§I\-A](https://arxiv.org/html/2605.23652#S1.SS1.SSS0.Px4.p1.1),[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px3.p1.1)\.
- \[15\]J\. Z\. Leibo, E\. A\. Dueñez\-Guzman, A\. Vezhnevets, J\. P\. Agapiou, P\. Sunehag, R\. Koster, J\. Matyas, C\. Beattie, I\. Mordatch, and T\. Graepel\(2021\)Scalable evaluation of multi\-agent reinforcement learning with melting pot\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§IV\-B](https://arxiv.org/html/2605.23652#S4.SS2.p1.2),[§VI](https://arxiv.org/html/2605.23652#S6.p1.1)\.
- \[16\]R\. Lowe, Y\. Wu, A\. Tamar, J\. Harb, P\. Abbeel, and I\. Mordatch\(2017\)Multi\-agent actor\-critic for mixed cooperative\-competitive environments\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§VI](https://arxiv.org/html/2605.23652#S6.p1.1)\.
- \[17\]R\. R\. McCrae and P\. T\. Costa Jr\(1992\)An introduction to the five\-factor model and its applications\.Journal of Personality60\(2\),pp\. 175–215\.Cited by:[§IV\-A](https://arxiv.org/html/2605.23652#S4.SS1.p2.4)\.
- \[18\]Nookipedia\(2026\)Villager\.Note:[https://nookipedia\.com/wiki/Villager](https://nookipedia.com/wiki/Villager)Accessed May 11, 2026Cited by:[§V\-E](https://arxiv.org/html/2605.23652#S5.SS5.p1.5)\.
- \[19\]J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InACM Symposium on User Interface Software and Technology \(UIST\),Cited by:[§I\-A](https://arxiv.org/html/2605.23652#S1.SS1.SSS0.Px3.p1.1),[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px4.p1.1),[§VIII](https://arxiv.org/html/2605.23652#S8.p1.1)\.
- \[20\]D\. Pathak, P\. Agrawal, A\. A\. Efros, and T\. Darrell\(2017\)Curiosity\-driven exploration by self\-supervised prediction\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§I\-A](https://arxiv.org/html/2605.23652#S1.SS1.SSS0.Px4.p1.1)\.
- \[21\]E\. Perez, F\. Strub, H\. De Vries, V\. Dumoulin, and A\. Courville\(2018\)FiLM: visual reasoning with a general conditioning layer\.InAAAI Conference on Artificial Intelligence,Cited by:[§III\-B](https://arxiv.org/html/2605.23652#S3.SS2.p1.1)\.
- \[22\]S\. Reed, K\. Zolna, E\. Parisotto, S\. G\. Colmenarejo, A\. Novikov, G\. Barth\-Maron, M\. Gimenez, Y\. Sulsky, J\. Kay, J\. T\. Springenberg,et al\.\(2022\)A generalist agent\.arXiv preprint arXiv:2205\.06175\.Cited by:[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px4.p1.1)\.
- \[23\]N\. Reimers and I\. Gurevych\(2019\)Sentence\-BERT: sentence embeddings using Siamese BERT\-networks\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§III\-A](https://arxiv.org/html/2605.23652#S3.SS1.p1.2)\.
- \[24\]T\. Schaul, D\. Horgan, K\. Gregor, and D\. Silver\(2015\)Universal value function approximators\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px1.p1.1)\.
- \[25\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§III\-C](https://arxiv.org/html/2605.23652#S3.SS3.p1.6)\.
- \[26\]N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§I\-A](https://arxiv.org/html/2605.23652#S1.SS1.SSS0.Px3.p1.1)\.
- \[27\]D\. Silver, T\. Hubert, J\. Schrittwieser, I\. Antonoglou, M\. Lai, A\. Guez, M\. Lanctot, L\. Sifre, D\. Kumaran, T\. Graepel, T\. Lillicrap, K\. Simonyan, and D\. Hassabis\(2018\)A general reinforcement learning algorithm that masters chess, shogi, and Go through self\-play\.Science362\(6419\),pp\. 1140–1144\.Cited by:[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px4.p1.1)\.
- \[28\]J\. K\. Terry, B\. Black, N\. Grammel, M\. Jayakumar, A\. Hari, R\. Sullivan, L\. S\. Santos, C\. Dieffendahl, C\. Horsch, R\. Perez\-Vicente, N\. L\. Williams, Y\. Lokesh, and P\. Ravi\(2021\)PettingZoo: gym for multi\-agent reinforcement learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§IV\-A](https://arxiv.org/html/2605.23652#S4.SS1.p1.1)\.
- \[29\]The Sims Wiki\(2026\)Trait \(The Sims 3\)\.Note:[https://sims\.fandom\.com/wiki/Trait\_\(The\_Sims\_3\)](https://sims.fandom.com/wiki/Trait_(The_Sims_3))Accessed May 11, 2026Cited by:[§V\-E](https://arxiv.org/html/2605.23652#S5.SS5.p1.5)\.
- \[30\]A\. van den Oord, Y\. Li, and O\. Vinyals\(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[§III\-C](https://arxiv.org/html/2605.23652#S3.SS3.p2.2)\.
- \[31\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§III\-A](https://arxiv.org/html/2605.23652#S3.SS1.p1.2)\.
- \[32\]O\. Vinyals, I\. Babuschkin, W\. M\. Czarnecki, M\. Mathieu, A\. Dudzik, J\. Chung, D\. H\. Choi, R\. Powell, T\. Ewalds, P\. Georgiev,et al\.\(2019\)Grandmaster level in StarCraft II using multi\-agent reinforcement learning\.Nature575\(7782\),pp\. 350–354\.Cited by:[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px4.p1.1)\.
- \[33\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§I\-A](https://arxiv.org/html/2605.23652#S1.SS1.SSS0.Px3.p1.1),[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px4.p1.1)\.
- \[34\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§I\-A](https://arxiv.org/html/2605.23652#S1.SS1.SSS0.Px3.p1.1)\.
- \[35\]G\. N\. Yannakakis and J\. Togelius\(2018\)Artificial intelligence and games\.Springer\.Cited by:[§I\-A](https://arxiv.org/html/2605.23652#S1.SS1.SSS0.Px1.p1.1)\.
- \[36\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§I\-A](https://arxiv.org/html/2605.23652#S1.SS1.SSS0.Px3.p1.1),[§II](https://arxiv.org/html/2605.23652#S2.SS0.SSS0.Px4.p1.1)\.
- \[37\]C\. Yu, A\. Velu, E\. Vinitsky, J\. Gao, Y\. Wang, A\. Bayen, and Y\. Wu\(2022\)The surprising effectiveness of PPO in cooperative multi\-agent games\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§III\-C](https://arxiv.org/html/2605.23652#S3.SS3.p1.6)\.
- \[38\]S\. Zhang, E\. Dinan, J\. Urbanek, A\. Szlam, D\. Kiela, and J\. Weston\(2018\)Personalizing dialogue agents: i have a dog, do you have pets too?\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§VIII](https://arxiv.org/html/2605.23652#S8.p1.1)\.
- \[39\]Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou\(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.Note:arXiv preprint arXiv:2506\.05176Cited by:[§III\-A](https://arxiv.org/html/2605.23652#S3.SS1.p1.2)\.
One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

Similar Articles

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

Beyond Static Personas: Situational Personality Steering for Large Language Models

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Submit Feedback

Similar Articles

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models
Beyond Static Personas: Situational Personality Steering for Large Language Models
MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs