OdysSim: Building Foundation Models for Human Behavior Simulation

arXiv cs.CL 06/15/26, 04:00 AM Papers
human-simulation foundation-model behavior-modeling llm multi-task open-source training-recipe
Summary
OdysSim presents a systematic investigation into behavioral foundation models for simulating human behavior, introducing the Soul taxonomy, a corpus of 21.4M interactions, and a training recipe that achieves state-of-the-art on 8 of 23 benchmark tasks while producing more human-like outputs.
arXiv:2606.14199v1 Announce Type: new Abstract: Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation. Yet helpfulness-driven post-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap. We present OdysSim, the largest open systematic investigation of behavioral foundation models, i.e., models trained to simulate human behavior at scale. We propose SOUL, a taxonomy of five capability axes (CONV, SS, COG, ROLE, EVAL) that unifies 62 datasets and 23 benchmark tasks under one framework. Specifically, we curate the OdysSim corpus (21.4M interactions, 10B tokens, retrofitted with back-generated social contexts), construct the SOUL-Index benchmark, and develop an end-to-end training recipe combining midtraining, task-specific RL, and expert distillation. The resulting open 8B OSim model ranks first or tied-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks. Its outputs are also more human-like in length, formatting, and word choice, and it transfers zero-shot to out-of-distribution user simulation on $\tau$-bench, nearly matching real users on reaction alignment (93.2 vs. 93.5). We further show that LLM-as-judge RL induces reward-hacking patterns, and that our detectors can mitigate them during post-training. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm. We release all artifacts to support future research.
Original Article
View Cached Full Text
Cached at: 06/15/26, 08:58 AM
# 𝒪dysSim Building Foundation Models for Human Behavior Simulation
Source: [https://arxiv.org/html/2606.14199](https://arxiv.org/html/2606.14199)
Xuhui Zhou1Weiwei Sun111footnotemark:1Weihua Du1Jiarui Liu1Haojia Sun1 Qianou Ma1Tongshuang Wu1Yiming Yang1Maarten Sap1 1Carnegie Mellon University, Language Technologies Institute \{xuhuiz, weiweis\}@andrew\.cmu\.edu [Code](https://github.com/sunnweiwei/OdysSim)![[Uncaptioned image]](https://arxiv.org/html/2606.14199v1/logo/huggingface.png)[Model](https://huggingface.co/collections/cmu-lti/odyssim)![[Uncaptioned image]](https://arxiv.org/html/2606.14199v1/logo/huggingface.png)[Midtraining Data](https://huggingface.co/datasets/cmu-lti/osim-mid-training)![[Uncaptioned image]](https://arxiv.org/html/2606.14199v1/logo/huggingface.png)[Post\-training Data](https://huggingface.co/datasets/cmu-lti/osim-post-training)

###### Abstract

Large language models are increasingly deployed as human simulators for interactive evaluation and social simulation\. Yet helpfulness\-driven post\-training pulls them toward a homogeneous, overly agreeable assistant register, creating a behavioral Sim2Real gap\. We present𝒪\\mathcal\{O\}dysSim, the largest open systematic investigation ofbehavioral foundation models, i\.e\., models trained to simulate human behavior at scale\. We proposeSoul, a taxonomy of five capability*axes*\(CONV,SS,COG,ROLE,EVAL\) that unifies 62 datasets and 23 benchmark tasks under one framework\. Specifically, we curate the𝒪\\mathcal\{O\}dysSimcorpus \(21\.4M interactions, 10B tokens, retrofitted with back\-generated social contexts\), construct theSoul\-Index benchmark, and develop an end\-to\-end training recipe combining midtraining, task\-specific RL, and expert distillation\. The resulting open 8B𝒪\\mathcal\{O\}simmodel ranks first or tied\-first on 8 of 23 tasks, outperforming any individual frontier model by this count, with the strongest gains on conversational and social tasks\. Its outputs are also more human\-like in length, formatting, and word choice, and it transfers zero\-shot to out\-of\-distribution user simulation onτ\\tau\-bench, nearly matching real users on reaction alignment \(93\.2 vs\. 93\.5\)\. We further show that LLM\-as\-judge RL induces reward\-hacking patterns, and that our detectors can mitigate them during post\-training\. Together, our findings suggest that behavioral foundation models require rethinking the LLM training paradigm\. We release all artifacts to support future research\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/x1.png)Figure 1:Benchmark results on human simulation tasks\.## 1Introduction

Simulating human behavior is becoming a critical capability for AI systems\. Realistic behavioral models are needed for user simulation in agent evaluation\(Yao et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib59)\), patient simulation in clinical training\(Kyung et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib24)\), learner simulation in educational technology\(Ross & Andreas,[2025a](https://arxiv.org/html/2606.14199#bib.bib40)\), and persona simulation in social science\(Park et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib37); Argyle et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib1)\)\. Yet current large language models \(LLMs\) fall short: they are systematically biased, stylistically uniform, and excessively agreeable, exhibiting what has been termed the*Sim2Real gap*\(Zhou et al\.,[2026a](https://arxiv.org/html/2606.14199#bib.bib67)\), and prompting alone does not suffice\(especially for more “undesirable” behaviors that humans naturally display; Li et al\.,[2025b](https://arxiv.org/html/2606.14199#bib.bib29)\)\. The root cause lies in the LLM training pipeline: \(i\) standard pretraining ingests vast amounts of internet text, including but not necessarily real human behavior, \(ii\) helpfulness\-driven post\-training\(e\.g\., RLHF; Ouyang et al\.,[2022](https://arxiv.org/html/2606.14199#bib.bib36)\)actively pulls models toward an assistant register, and \(iii\) evaluation protocols typically reward task success and instruction following, while leaving behavioral realism, diversity, and social fidelity under\-specified\.

Closing the gap requires rethinking the pipeline end\-to\-end:*what we measure*,*what data*the model learns from, and*how*it is trained\. We present𝒪\\mathcal\{O\}dysSim, the largest open effort to build a behavioral foundation model111We use*behavioral foundation model*in the natural\-language sense throughout: a model trained at scale to simulate human behavior in linguistic interaction\. This is distinct from the embodied\-control sense used in robotics, where “behavior foundation model” refers to whole\-body motor control policies for humanoid robots\(Tirinzoni et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib50); Zeng et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib61)\)\.to date, comprising a 23\-task benchmark, a 21\.4M\-interaction \(10B\-token\) midtraining corpus from 62 public sources, and an end\-to\-end RL recipe\. Additionally, as shown in Figure[2](https://arxiv.org/html/2606.14199#S1.F2), we designSoul\(Simulation Of hUman\-Like behavior\), a framework that defines five capability*Axes*\(CONV,SS,COG,ROLE,EVAL\) to jointly index the𝒪\\mathcal\{O\}dysSimcorpus and theSoul\-Index evaluation suite \([Section3](https://arxiv.org/html/2606.14199#S3.SS0.SSS0.Px3)\)\. Behavior simulation is inherently grounded: to simulate a human response, a model must condition not only on an input utterance or situation, but also on who the speaker is, what role they occupy, and what social intent shapes the interaction\. We therefore formalize behavioral simulation as generating a response given both an interaction context and a social grounding specification, such as a character profile, role, or goal\. This creates a practical data challenge: many raw sources used for midtraining, such as WildChat entries\(Zhao et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib62)\)and ConvoKit threads\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\), contain rich dialogue but lack explicit speaker grounding, making social dynamics hard to infer from text alone\. We address this by retrofitting each dialogue with back\-generated social contexts, including a character profile and interaction goal\. We further show supporting training with social grounding context is important for learning to simulate human behavior\.

At the core of our investigation, we midtrain Qwen3 base models on the𝒪\\mathcal\{O\}dysSimcorpus to create𝒪\\mathcal\{O\}sim\-Mid\. With the𝒪\\mathcal\{O\}sim\-Mid, we further perform task\-specific reinforcement learning and create an expert model for eachSoul\-Index task: GRPO when the task has a verifiable reward, and RL with verbal feedback when the task is judged by an LLM that returns both a scalar reward and a textual critique\(Sun et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib49); Song et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib47)\)\. Finally, we use expert distillation to merge the resulting task\-specific experts into a single deployable model\. The two stages are complementary: midtraining provides a behaviorally\-aware initialization \(*what*human behavior looks like at scale\); task\-specific RL adds precision under the right reward signal \(*how*to behave on each task\)\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/x2.png)Figure 2:Overview of the𝒪\\mathcal\{O\}dysSimrecipe\. We iteratively collect and curate the𝒪\\mathcal\{O\}dysSimcorpus, build theSoulframework, and construct theSoul\-Index as an evaluation suite\. We first midtrain a base Qwen3 checkpoint into𝒪\\mathcal\{O\}sim\-Mid\. Then we follow with task\-specific RL, training one expert perSoul\-Index task\. Expert distillation then merges these experts into the final𝒪\\mathcal\{O\}simmodel\.Putting the recipe together: midtraining, then task\-specific RL, then expert distillation, yields𝒪\\mathcal\{O\}sim\-8B, which leadsSoul\-Index and is best or tied\-best on more tasks \(8 of 23\) than any individual frontier model\. Its improvements concentrate on the interactive, socially grounded tasks that general\-purpose post\-training underscores, producing more “human\-like” behaviors such as shorter sentences and fewer assistant\-like phrases\. The gains also transfer beyond chat settings: onτ\\tau\-USI, an out\-of\-distribution user\-simulation evaluation for tool\-use agents,𝒪\\mathcal\{O\}sim\-8B achieves the strongest reaction alignment among evaluated simulators, nearly matching real users \(React93\.293\.2vs\.93\.593\.5\), outperforming any frontier models\. Our ablations show that the two stages contribute in qualitatively different ways: midtraining alone shifts outputs toward the human register in length, formatting, and word choice, lifting Qwen3\-8B\-Base from26\.926\.9to41\.141\.1onSoul\-Index; task\-specific RL then adds the largest gains on role\-playing \(ROLE\) and conversational \(CONV\) tasks\. Together, these results close the loop on our central claim: building behavioral foundation models requires aligning what we measure \(Soul\-Index\), what data the model learns from \(𝒪\\mathcal\{O\}dysSimcorpus\), and what the training objective rewards around behavioral realism rather than task success alone\.

#### Contributions\.

\(1\) TheSoulFramework\.A single set of five behavioral\-capability*Axes*that jointly guides the midtraining, post\-training, and evaluation, together with theSoul\-Index— to our knowledge the most comprehensive open evaluation for human\-behavior simulation\.\(2\) The𝒪\\mathcal\{O\}dysSimCorpus\.A behavioral midtraining corpus of 21\.4M interactions \(≈\\approx10B tokens\) from 62 public sources, unified into a common conversational format and equipped with a*retrofit pipeline*that back\-generates per\-conversation social groundings \(e\.g\., character profile, interaction goal\)\.\(3\) End\-to\-End Recipe\.Midtraining on𝒪\\mathcal\{O\}dysSim, task\-specific RL on eachSoul\-Index task \(GRPO and RLVF\), and expert distillation into a single final model that improves both in\-benchmark behavior simulation and zero\-shot user simulation for tool\-using agents\.

## 2Related Work

#### Evaluating Behavioral Simulation\.

Existing benchmarks for behavioral simulation focus on specific aspect of human behavior, each targeting narrow capability or task format: theory of mind\(Kim et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib21); Le et al\.,[2019](https://arxiv.org/html/2606.14199#bib.bib26)\), social interaction\(Zhou et al\.,[2024b](https://arxiv.org/html/2606.14199#bib.bib66)\), role\-play with persona\(Wang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib53); Kirk et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib22); Li et al\.,[2025a](https://arxiv.org/html/2606.14199#bib.bib28)\), social and cognitive experiments\(Kolluri et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib23); Binz et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib4)\)or more recently, user behavior simulation interacting with AI agents\(Dou et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib11); Zhou et al\.,[2026a](https://arxiv.org/html/2606.14199#bib.bib67)\)\. This fragmentation makes it hard to track the overall progress in one aspect \(say, theory\-of\-mind accuracy\) implies anything about a different capability \(say, role\-play fidelity\), or to compare modeling approaches that target different capabilities\.

#### Training Behavioral Foundation Models\.

Prior efforts to train LLMs for human\-behavior simulation differ in methodology and scale, but each remains bounded to narrow behavioral domain\. Many adapt a general\-purpose post\-trained LLM via SFT or RL:Sotopia\-π\\pi\(Wang et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib52)\)clones expert social\-interaction trajectories,Sotopia\-RL\(Yu et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib60)\)adds utterance\-level multi\-dimensional rewards,Omar\(Jiang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib19)\)trains via multi\-agent self\-play, andUserLM\(Naous et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib34)\)finetune for the user side ofWildChat\(Zhao et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib62)\)dialogues\. Others build new corpora but stay within a single domain:Centaur\(Binz et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib4)\)finetunes onPsych\-101\(10M choices from 160 cognitive\-psychology experiments\),Be\.FM\(Xie et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib57)\)targets four behavioral\-science capabilities,Socrates\(Kolluri et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib23)\)finetunes onSocSci210\(2\.9M social\-science responses\)\. WhileSun et al\. \([2026](https://arxiv.org/html/2606.14199#bib.bib49)\)andWu et al\. \([2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)investigate diverse domains of tasks and capabilities, both initialize from instruction\-tuned models already optimized to act as helpful assistants, which risks suppressing the behavioral diversity needed to faithfully simulate human behavior\.

#### Midtraining and RL with LLM\-Judge Feedback\.

Midtraining adapts pretrained models to a target distribution before post\-training\(Gururangan et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib14); Liu et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib31); Mo et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib33)\), but prior work mainly studies domains such as code\(Rozière et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib42)\)and math, where the shift is largely lexical, syntactic, or task\-skill driven\. Our setting differs: human\-behavior simulation requires socially grounded shifts in persona, intent, register, and interaction style, which have not been systematically studied under midtraining\. For open\-ended behavioral tasks, prior work uses LLM judges and sometimes textual feedback to optimize generations\(Zheng et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib63); Verga et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib51); Sun et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib49); Song et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib47)\)\. We build on this line, but focus on behavioral fidelity rather than helpfulness or task success\.

## 3TheSoulFramework

![Refer to caption](https://arxiv.org/html/2606.14199v1/x3.png)Figure 3:The fiveSoulAxes\. Each strip lists the𝒪\\mathcal\{O\}dysSimcorpus datasets contributing to that Axis \(left, 62 sources, 21\.4M interactions,≈\\approx10B tokens\) and theSoul\-Index evaluation tasks for that Axis \(right, 23 tasks\)\. Sources that appear on both sides are listed once on the eval side; “…\\ldots” marks truncated corpus pills\.CONV: discourse and interaction dynamics;SS: social skills;COG: cognitive / mental\-state reasoning;ROLE: persona, roleplay, and pedagogy;EVAL: judgment and preference\.We first introduceSoul\(Simulation Of human\-Like behavior\), a framework that identifies the capability axes used to \(i\) categorize the𝒪\\mathcal\{O\}dysSimmidtraining corpus and \(ii\) aggregate the scores of behavioral fidelity on theSoul\-Index evaluation suite\.[Figure3](https://arxiv.org/html/2606.14199#S3.F3)shows the fiveSoulAxes and the datasets and tasks that contribute to each Axis\. Please refer to Appendix[E](https://arxiv.org/html/2606.14199#A5)for more details\.

#### SoulAxes

We define theSoulaxes through a two\-stage taxonomy\-building process\. We first greedily collect all the public datasets and benchmarks that are relevant to the capability of simulating human behavior such as users interacting with AI agents, reddit conversations, movie dialogues, online shopping, psychological experiments, and etc\.*\(i\) Bottom\-up:*we audited each candidate dataset and clustered them by the dominant social or cognitive phenomenon their interactions capture \(e\.g\., persuasion, emotion support, false\-belief reasoning, role\-play with persona\)\.*\(ii\) Top\-down:*we anchored these emergent clusters against cognitive and social\-psychology literature\(Hymes,[1972](https://arxiv.org/html/2606.14199#bib.bib18); Cialdini,[2007](https://arxiv.org/html/2606.14199#bib.bib9); Baron\-Cohen et al\.,[1985](https://arxiv.org/html/2606.14199#bib.bib3)\), and formalized them into the five Axes in[Figure3](https://arxiv.org/html/2606.14199#S3.F3)\. The axes are intended as a practical organizing scheme for this work, not an exhaustive taxonomy of human behavior\. For corpus construction and benchmark reporting, each dataset or task is assigned to the axis that best reflects its dominant capability, while recognizing that many sources involve multiple behaviors\.

#### The𝒪\\mathcal\{O\}dysSimCorpus

We build the𝒪\\mathcal\{O\}dysSimcorpus by unifying 62 datasets \(≈\\approx10B training tokens, 21\.4M interactions\) into a common conversational format, organized along the fiveSoulAxes \([Figure3](https://arxiv.org/html/2606.14199#S3.F3), left column\)\. For sources that lack persona or scenario \(e\.g\., open\-domain AI chat logs like WildChat,Zhao et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib62); ConvoKit corpora,Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\), we synthesize a per\-record social context, a textual description of who is speaking, their role, goal, and conversational style, generated from the first 60% of each conversation’s turns so the persona cannot foreshadow the trajectory\. Sources that natively carry persona or scenario information \(e\.g\., CoSER,Wang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib53); Sotopia,Zhou et al\.,[2024b](https://arxiv.org/html/2606.14199#bib.bib66); Humanual,Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\) retain their original system prompt without modification\. After this step, the 62 datasets together comprise 19\.7M unique social contexts containing*1\.09M distinct personas*\(the union of matched occupation, trait, demographic, and personality terms\), covering 414 occupations, 358 demographic markers, and 86 personality types \(See[SectionE\.4](https://arxiv.org/html/2606.14199#A5.SS4)for more details\)\.*Train and test sets are persona\-disjoint by construction:*distinct test personas are unseen during training\. The full task\-to\-training\-source mapping, per\-source row counts, the split logic are in[SectionE\.6](https://arxiv.org/html/2606.14199#A5.SS6)\.

#### TheSoul\-Index

Evaluating human behavior simulation requires both*breadth*\(covering diverse behavioral facets\) and*depth*\(grounding in real human behavior rather than proxy metrics alone\)\. We introduceSoul\-Index that operationalizes both: 23 benchmarks organized bySoulAxis \([Figure3](https://arxiv.org/html/2606.14199#S3.F3), right column\), with formats spanning discriminative \(MCQ, binary, ranking\) and generative \(single\- and multi\-turn dialogue\)\. All scores are normalized to\[0,1\]\[0,1\]and aggregated by arithmetic mean across tasks\.[Table13](https://arxiv.org/html/2606.14199#A6.T13)\(appendix\) lists every task with its parent axis, format, and evaluation metric; per\-task descriptions are in[AppendixF](https://arxiv.org/html/2606.14199#A6)\.

During post\-training, we curate axis\-aligned training data for everySoul\-Index task: when a benchmark comes with its own training split \(HUMANUAL, CoSER, the ToM benchmarks, Sotopia, AlignX, HumanLLM, SocSci210\), we use that split directly as targeted axis\-aligned data and enforce row\-level disjointness against theSoul\-Index test rows; for tasks without a native training set \(e\.g\., Sotopia\-Hard\), we substitute closely\-related data from the same benchmark family \(e\.g\., SOTOPIA\-π\\pi,Wang et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib52)scenarios\) as a proxy\. Per\-source statistics, the training dataset construction process, and more dataset details are in[SectionE\.6](https://arxiv.org/html/2606.14199#A5.SS6)\.

Table 1:Per\-skill geometric\-mean PPL \(PL↓\\downarrow\) and arithmetic\-mean BLEU \(BL↑\\uparrow\) on the evaluation split of role\-swapped human turns\. Rows:\(A\)no midtraining,\(B\)other baselines,\(C\)ours\.ModelCONVSSCOGROLEEVALOverallPL↓\\downarrowBL↑\\uparrowPL↓\\downarrowBL↑\\uparrowPL↓\\downarrowBL↑\\uparrowPL↓\\downarrowBL↑\\uparrowPL↓\\downarrowBL↑\\uparrowPL↓\\downarrowBL↑\\uparrow\(A\) No\-midtraining baselinesQwen3\-0\.6B20\.750\.7025\.250\.6111\.172\.3711\.396\.3754\.983\.9617\.432\.80Qwen3\-4B14\.081\.4717\.521\.918\.075\.608\.5610\.7824\.146\.2712\.005\.18Qwen3\-8B14\.232\.8717\.341\.459\.723\.638\.1114\.9434\.174\.1712\.546\.17\(B\) Other baselinesUserLM\-8B11\.625\.3313\.295\.874\.026\.976\.614\.3712\.901\.118\.385\.12CoSER\-8B15\.702\.8614\.171\.973\.207\.396\.9614\.908\.8518\.128\.778\.05Llama\-3\.1\-8B14\.344\.1120\.091\.934\.163\.587\.8212\.4113\.477\.8110\.046\.23\(C\) Ours𝒪\\mathcal\{O\}sim\-0\.6B\-Mid11\.995\.5114\.182\.012\.6511\.756\.4615\.265\.6843\.027\.3511\.81𝒪\\mathcal\{O\}sim\-4B\-Mid8\.238\.019\.6710\.062\.0944\.624\.6521\.534\.2646\.175\.2826\.08𝒪\\mathcal\{O\}sim\-8B\-Mid7\.628\.489\.0012\.442\.0144\.734\.3622\.494\.0345\.474\.9526\.72

## 4Midtraining: Setup and Results

Our goal is to model the diversity of real human behavior rather than the helpful, homogeneous register of an assistant\. Midtraining is the first stage of the𝒪\\mathcal\{O\}dysSimrecipe: it adapts a pretrained base model on the𝒪\\mathcal\{O\}dysSimcorpus to shift its broad, mode\-covering language prior toward a behaviorally aware “human\-side” distribution, producing𝒪\\mathcal\{O\}sim\-Mid for later refinement\. We start from a base checkpoint because pretraining already provides broad linguistic and behavioral coverage, while midtraining can specialize this prior without relearning general competence from scratch\. We also avoid instruction\-tuned checkpoints at this stage, since helpfulness\-driven post\-training encourages verbose, agreeable, and homogeneous assistant behavior, making such models poor simulators of diverse human behavior\(Jiang et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib20)\)\. Following work that frames midtraining as a bridge between pretraining and post\-training\(Mo et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib33)\), we evaluate this stage with perplexity \(PPL\) and short\-form generation overlap \(BLEU\) on the𝒪\\mathcal\{O\}dysSimtest split\.

#### Setup\.

We midtrain Qwen3 base checkpoints at three scales \(0\.6B, 4B, 8B;Qwen Team,[2025](https://arxiv.org/html/2606.14199#bib.bib38)\) on the𝒪\\mathcal\{O\}dysSimcorpus, and compare them withUserLM\-8B\(Naous et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib34)\),CoSER\-8B\(Wang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib53)\), and Llama\-3\.1\-8B\. To study the effect of social context, we train a second Qwen3\-0\.6B\-Base variant with all system messages removed from the training examples\. As a data curation control, we ask whether the gains come from𝒪\\mathcal\{O\}dysSim’s behavioral curation or from generic chat SFT data alone: we train Qwen3\-4B\-Base on Step\-3\.5\-Flash\-SFT\(StepFun AI,[2025](https://arxiv.org/html/2606.14199#bib.bib48)\)only \(\+ Step\) and also sweep𝒪\\mathcal\{O\}dysSim:Step token mixtures\. The sweep shows that a 10% Step mixture \(90:1090\{:\}10𝒪\\mathcal\{O\}dysSim:Step\) improves generic\-instruction loss at little behavioral cost, while larger Step fractions increasingly trade away behavioral fit \(see App\.[H\.1](https://arxiv.org/html/2606.14199#A8.SS1)for the full analysis\)\. All midtraining runs use AdamW with peak learning rate1×10−51\\\!\\times\\\!10^\{\-5\},1616K\-input /88K\-response context, mini\-batch size1,0241\{,\}024entries per step, and88H100\-80GB GPUs on one node with FSDP\-2 and mixed\-precisionbfloat16\. The default endpoint is4,5004\{,\}500optimizer steps\. Full hyperparameters, per\-dataset upsampling weights, and dynamic batching details are in Appendix[G](https://arxiv.org/html/2606.14199#A7)\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/x4.png)Figure 4:Capability ablation on Qwen3\-0\.6B\-Base\. Rows denote training data; columns denote evaluation Axis\. Cells report geomean PPL\.
![Refer to caption](https://arxiv.org/html/2606.14199v1/x5.png)Figure 5:System\-prompt ablation on Qwen3\-0\.6B\-Base\. Per\-Axis geomean PPL with vs\. without system prompts during training\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/x6.png)Figure 6:Behavioral probing of different models\.*\(A\):*median response length and binary\-presence rates for nine open\-coded surface features overN=7,451N\{=\}7\{,\}451prompts sampled from𝒪\\mathcal\{O\}dysSimtest split\. OdysSim is closer to the human reference on all features\.*\(B\):*per\-text HumT score\(Cheng et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib8)\)\. Higher score means more human\-like\.*\(C\):*concrete examples from our behavioral probing prompts\. For each prompt, we show three𝒪\\mathcal\{O\}simgenerations\.
#### Per\-axis behavioral fit\.

[Table1](https://arxiv.org/html/2606.14199#S3.T1)reports per\-axis PPL and BLEU on the held\-out evaluation split, with PPL aggregated as the geometric mean of per\-dataset perplexities so each dataset is weighted equally\.𝒪\\mathcal\{O\}sim\-4B\-Mid obtains overall PPL5\.285\.28and BLEU26\.0826\.08, and the 8B checkpoint further lowers PPL to4\.954\.95and raises BLEU to26\.7226\.72\. Three observations follow\.*\(i\) Midtraining yields a substantial behavioral\-fit gain\.*Raw Qwen3 checkpoints remain weak human simulators \(Qwen3\-4B:12\.0012\.00PPL /5\.185\.18BLEU\); midtraining on the𝒪\\mathcal\{O\}dysSimcorpus significantly reduces PPL \(4B:5\.285\.28; 8B:4\.954\.95\) and improves BLEU by roughly5×5\\timesat both scales \(4B:26\.0826\.08; 8B:26\.7226\.72\)\.*\(ii\) Behavioral curation outperforms targeted simulation baselines\.*This comparison includesUserLM\-8B, which is trained to model the user side of WildChat dialogues, andCoSER\-8B, which targets literary\-character role\-play and persona simulation\. Among these targeted 8B baselines and the general Llama\-3\.1\-8B reference,𝒪\\mathcal\{O\}sim\-8B\-Mid has both the lowest PPL \(4\.954\.95vs\. UserLM\-8B8\.388\.38, CoSER\-8B8\.778\.77, Llama\-3\.1\-8B10\.0410\.04\) and the highest BLEU score\.*\(iii\) Parameter scale compounds with the right data\.*Both raw Qwen3 and𝒪\\mathcal\{O\}dysSimimprove as parameter count increases: midtraining drives PPL down by6\.726\.72at 4B \(Qwen3\-4B12\.0012\.00vs\.𝒪\\mathcal\{O\}sim\-4B\-Mid5\.285\.28\) and7\.597\.59at 8B \(Qwen3\-8B12\.5412\.54vs\.𝒪\\mathcal\{O\}sim\-8B\-Mid4\.954\.95\)\. The 4B→\\to8B parameter step further improves PPL on every axis\.222Starting from instruction\-tuned Qwen3 checkpoints produces similar PPL/BLEU after𝒪\\mathcal\{O\}dysSimmidtraining as starting from the corresponding base checkpoints\.Together, these results suggest that𝒪\\mathcal\{O\}dysSimmidtraining turns a broadly pretrained language model into a stronger behavioral prior across axes, with parameter scale improving performance\.

#### Are theSoulAxes compositional?

To see how the fiveSoulAxes \([Figure3](https://arxiv.org/html/2606.14199#S3.F3)\) influence each other, we midtrain Qwen3\-0\.6B\-Base on each axis separately, plus an*Overall*reference trained on the full 63\-dataset mix\.[Figure5](https://arxiv.org/html/2606.14199#S4.F5)\(left\) shows three patterns: every specialist is best on its own axis \(strong diagonal\), off\-diagonal cells are typically1\.5−2\.7×1\.5\\\!\-\\\!2\.7\\\!\\timeshigher on PPL compared to the diagonal value, and the*Overall*model is within a small margin of every column\-minimum specialist onCONV/SS/COG/ROLE/EVAL\. These results indicate that the axes capture complementary behavioral skills: training on one axis does not transfer fully to the others, while the multi\-axis mixture is important for a general behavioral foundation model\.

#### System prompts as social context\.

To test the role of social grounding in𝒪\\mathcal\{O\}dysSim, we midtrain the 0\.6B model with system messages stripped, then evaluate with system prompts present \(this is the setting a deployed simulator faces, where a persona or scenario is supplied through the system prompt\)\.[Figure5](https://arxiv.org/html/2606.14199#S4.F5)\(right\) shows that removing system prompts during midtraining hurts overall PPL by13%13\\%\(7\.35→8\.307\.35\\to 8\.30\), mainly on axes that social grounding is most important \(ROLE\+25%\+25\\%, SS\+16%\+16\\%\), with little effect on cognition \(COG\+1%\+1\\%\)\. This shows that training with social grounding is important for simulating human behavior across diverse instructions and scenarios\.

#### What changes after midtraining?

Beyond per\-axis PPL, we probe output changes from three angles \([Figure6](https://arxiv.org/html/2606.14199#S4.F6)\): lexical and structural features on𝒪\\mathcal\{O\}dysSimtest\-set generations \(panel A\), HumT human\-likeness scores\(Cheng et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib8)\)on held\-out prompts \(panel B\), and 30 qualitative probing prompts following Chen et al\.\(Chen et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib7)\), with three independent𝒪\\mathcal\{O\}simsamples per prompt to measure intra\-model variation \(panel C\)\. All three show the same pattern\. The generic\-instruction baseline produces responses roughly twice as long as human references, with frequent Markdown headers, bullet lists, em\-dashes, and assistant\-style openers \(e\.g\.,*“I’d be happy to,”**“Of course\!”**“As an AI”*\)\. After midtraining,𝒪\\mathcal\{O\}simshifts toward a more “human” register: length matches human references, structural markup falls within one percentage point of the human rate, and assistant openers more than halve\. HumT also places𝒪\\mathcal\{O\}simwell above both the generic\-instruction baseline and raw pretrained checkpoint in human\-likeness\. Qualitatively,𝒪\\mathcal\{O\}simsamples vary in framing, vocabulary, and emotional register while avoiding the baseline’s hedging\-and\-listing assistant style\.

## 5Post\-training: Setup and Results

A human behavior simulator should be useful across diverse downstream settings, from user modeling and social interaction to role\-play, theory\-of\-mind, and human\-like evaluation\. Post\-training targets this diversity directly: each of the 23Soul\-Index benchmarks supplies a task\-specific behavioral objective, and RL optimizes the model against those interactive rewards rather than corpus likelihood alone\. This section completes the𝒪\\mathcal\{O\}dysSimrecipe by training task\-specific RL experts in theSoulenvironments, distilling their highest\-scoring rollouts into a single model, and evaluating the resulting𝒪\\mathcal\{O\}sim8B on the fullSoul\-Index\.

### 5\.1Post\-training Recipe

#### High\-level pipeline\.

Our post\-training pipeline has two stages\. First, we train one RL expert perSoul\-Index task, so each expert can specialize to that task’s reward signal and interaction protocol\. Tasks with verifiable rewards use GRPO directly; tasks judged by an LLM use our verbal\-feedback variant, described below\. Second, we merge these specialists through expert distillation: each expert generates candidate trajectories, the task reward or judge selects the best responses, and a single model is supervised\-finetuned on the pooled selected data\.

#### Training details\.

Unless otherwise stated, each RL expert starts from Qwen3\-8B\-Instruct\(Qwen Team,[2025](https://arxiv.org/html/2606.14199#bib.bib38)\)and uses GRPO with 8 rollouts per prompt, sampling temperature 1, no KL loss, asymmetric clip ratios of 0\.2 \(low\) and 0\.28 \(high\), peak learning rate 5e\-6, and a batch of 64 prompts per step \(PPO mini\-batches of 16\)\. We train for 200 steps by default, extending to 500 steps for slower\-converging tasks \(e\.g\., AlignX, SocSci210, and the Humanual family; cf\.[Figure7](https://arxiv.org/html/2606.14199#S5.F7)\), on 8 H100\-80G GPUs with FSDP\-2 and mixed\-precisionbfloat16\. The RL stage uses LoRA with rank 32, implemented with Verl\. For distillation, we use rejection sampling to generate expert trajectories across training tasks, yielding 58,702 distillation examples after filtering \([AppendixD](https://arxiv.org/html/2606.14199#A4)\), then supervise\-finetune𝒪\\mathcal\{O\}sim\-8B\-Mid on the combined data with learning rate 1e\-5 and batch size 256 for 500 steps; that is, RL\-trained experts generate the trajectories, and the midtrained checkpoint is the distillation target\. See[AppendixD](https://arxiv.org/html/2606.14199#A4)for post\-training dataset composition and[AppendixG](https://arxiv.org/html/2606.14199#A7)for full post\-training hyperparameters\.

#### RL with Verbal Feedback\.

For tasks with verifiable rewards, we directly apply GRPO using the scalar task reward\. For LLM\-as\-judge tasks, where the judge can also provide textual critiques and improvement suggestions, we use verbal\-feedback RL and treat this feedback as training\-time information for the teacher model\(Sun et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib49)\)\. For each promptxx, we sampleGGstudent rollouts,

yi,0∼πθ\(⋅∣x\),i=1,…,G,y\_\{i,0\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\),\\qquad i=1,\\ldots,G,and obtain reward\-feedback pairs from the task judge,\(ri,0,hi\)=𝒥\(x,yi,0\)\.\(r\_\{i,0\},h\_\{i\}\)=\\mathcal\{J\}\(x,y\_\{i,0\}\)\.We then condition the same policy on the feedback to generate teacher rollouts,

yi,1∼πθ\(⋅∣x,hi\),ri,1=R\(x,yi,1\)\.y\_\{i,1\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x,h\_\{i\}\),\\qquad r\_\{i,1\}=R\(x,y\_\{i,1\}\)\.
For each prompt, we form a joint group𝒢\(x\)=\{yi,0,yi,1\}i=1G,\\mathcal\{G\}\(x\)=\\\{y\_\{i,0\},y\_\{i,1\}\\\}\_\{i=1\}^\{G\},and optimize a clipped GRPO loss over both student and teacher rollouts with group\-relative advantages from task rewards\. This lets the base policy absorb improvements induced by verbal feedback\. We further add an auxiliary GRPO loss on the feedback\-conditioned rollouts alone, with advantages normalized within\{ri,1\}i=1G\\\{r\_\{i,1\}\\\}\_\{i=1\}^\{G\}:

ℒRLVF=ℒgroup\(\{yi,0,yi,1\}i=1G\)\+ℒfb\(\{yi,1\}i=1G\)\.\\mathcal\{L\}\_\{RLVF\}=\\mathcal\{L\}\_\{\\mathrm\{group\}\}\(\\\{y\_\{i,0\},y\_\{i,1\}\\\}\_\{i=1\}^\{G\}\)\+\\mathcal\{L\}\_\{\\mathrm\{fb\}\}\(\\\{y\_\{i,1\}\\\}\_\{i=1\}^\{G\}\)\.

#### Expert Distillation\.

We empirically find that directly mixing all tasks into a single RL run is suboptimal: some tasks improve slowly or plateau early under joint training\. Simple model merging, such as averaging task\-specialized weights, also does not reliably preserve the gains of individual experts\. We therefore useexpert distillationto consolidate task\-specific RL experts into one model\.

For each taskmm, letπθm\\pi\_\{\\theta\_\{m\}\}be its RL expert\. Given a training promptxx, we sampleGGcandidate responses from this expert, score them with the task reward or judge, and keep the top\-KKresponses:

y1,…,yG∼πθm\(⋅∣x\),𝒮m\(x\)=TopKy∈\{y1,…,yG\}Rm\(x,y\)\.y\_\{1\},\\ldots,y\_\{G\}\\sim\\pi\_\{\\theta\_\{m\}\}\(\\cdot\\mid x\),\\qquad\\mathcal\{S\}\_\{m\}\(x\)=\\mathrm\{TopK\}\_\{y\\in\\\{y\_\{1\},\\ldots,y\_\{G\}\\\}\}R\_\{m\}\(x,y\)\.We collect the selected pairs from all tasks into a distillation dataset

𝒟distill=\{\(x,y\):y∈𝒮m\(x\),x∈𝒟m,m∈ℳ\}\.\\mathcal\{D\}\_\{\\mathrm\{distill\}\}=\\\{\(x,y\):y\\in\\mathcal\{S\}\_\{m\}\(x\),x\\in\\mathcal\{D\}\_\{m\},m\\in\\mathcal\{M\}\\\}\.Finally, we train a single model on𝒟distill\\mathcal\{D\}\_\{\\mathrm\{distill\}\}with the standard next\-token cross\-entropy loss\. This stage preserves most task\-specific RL gains while producing one general model\.

### 5\.2Benchmark Results

Table 2:Main Results\.We report the primary metric for each benchmark \(higher is better\)\. \*Others refers to the best result by other human\-simulation models, including HumanLM\-8B\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\), Sotopia\-RL\-7B\(Yu et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib60)\), UserLM\-8B\(Naous et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib34)\), Coser\-8B\(Wang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib53)\)\.Basedenotes the base model, Qwen3\-8B\-Base\.Boldindicates the best result in each row; ties are bolded and counted for all tied models\.Avgis the unweighted mean over the 23 benchmarks\.DimBenchmarkGPT5\.5Gemini3\.1 ProClaudeOpus 4\.7Qwen3\.6 PlusOthers\*Qwen38B InstBase8B𝒪\\mathcal\{O\}sim8B\-Mid𝒪\\mathcal\{O\}sim8BCONVUserLLM65\.367\.757\.672\.144\.646\.031\.049\.590\.1MirrorBench56\.748\.363\.748\.045\.454\.013\.949\.168\.3Humanual\-Chat28\.221\.022\.622\.225\.824\.712\.07\.828\.2SimArena\-Doc83\.483\.083\.582\.483\.583\.679\.680\.384\.1SSSotopia\-Hard31\.927\.832\.428\.331\.727\.721\.445\.649\.2COGFantom93\.093\.080\.089\.070\.023\.023\.062\.080\.0Hitom82\.086\.093\.073\.056\.062\.012\.054\.079\.0Paratomi99\.097\.090\.094\.075\.067\.019\.072\.083\.0Social\-R169\.079\.067\.067\.047\.054\.037\.042\.060\.0ROLECoser66\.262\.166\.555\.930\.343\.56\.124\.862\.6Lifechoices91\.084\.092\.079\.067\.070\.032\.058\.082\.0Twinvoice74\.086\.083\.071\.040\.042\.019\.025\.068\.0BehaviorChain95\.092\.096\.085\.036\.041\.018\.042\.094\.0SimArena\-Math68\.571\.568\.770\.970\.568\.966\.268\.170\.7Mistakes72\.073\.074\.067\.056\.027\.024\.018\.059\.0Humanual\-Email50\.146\.950\.447\.942\.843\.726\.422\.351\.4Humanual\-News40\.242\.341\.341\.833\.132\.512\.715\.142\.7Humanual\-Politics42\.032\.543\.531\.634\.233\.217\.815\.441\.9EVALAlignX71\.273\.471\.669\.866\.868\.649\.053\.672\.6Humanllm45\.746\.944\.242\.735\.234\.112\.116\.539\.1Socsci21077\.278\.077\.274\.575\.273\.646\.668\.175\.1Humanual\-Book57\.662\.461\.458\.450\.253\.621\.538\.863\.2Humanual\-Opinion39\.836\.046\.234\.237\.437\.218\.217\.042\.0Avg65\.264\.865\.561\.150\.248\.326\.941\.164\.6\#Best377000008![Refer to caption](https://arxiv.org/html/2606.14199v1/figures/rl_curves.png)Figure 7:RL training dynamics\.Task\-specific RL experts show consistent improvement across 23Soultasks, often reaching or surpassing frontier\-model baselines\. Dashed lines denote GPT 5\.5, Claude Opus 4\.7, and the distilled𝒪\\mathcal\{O\}sim8B; the distilled model recovers much of the RL gain in a single unified model while leaving a gap to the best per\-task experts\.We evaluate all models on held\-out slices of eachSoul\-Index task, capped at 100 instances per task \(500 for HumanLLM\), with generative tasks decoded at temperature 0\.7 \([AppendixG](https://arxiv.org/html/2606.14199#A7)\)\. Judge\-based tasks are scored with the same judge configuration used as the RL environment \(gpt\-5\-nano by default;[AppendixD](https://arxiv.org/html/2606.14199#A4)\)\.

#### Main results\.

Table[2](https://arxiv.org/html/2606.14199#S5.T2)shows that𝒪\\mathcal\{O\}sim8B reaches frontier\-level performanceacross human\-simulation benchmarks\. Averaged over the 23 benchmarks,𝒪\\mathcal\{O\}sim8B scores 64\.6, comparable to GPT\-5\.5, Gemini 3\.1 Pro, and Claude Opus 4\.7\. Despite having only 8B parameters, it achieves the best or tied\-best result on 8 / 23 benchmarks, more than any individual frontier model\. Its largest gains are on conversational and social\-skill tasks, outperforming the best frontier model on UserLLM by 18\.0 points, MirrorBench by 4\.6, and Sotopia\-Hard by 16\.8\. The stage ablation shows thatthe two stages contribute in different ways: midtraining shifts the model toward a human\-like register and behavioral fit \([Section4](https://arxiv.org/html/2606.14199#S4)\), while post\-training drives most of the benchmark gains \(see[AppendixC](https://arxiv.org/html/2606.14199#A3)for the full stage ablation, including instruct\-initialized variants\)\.

𝒪\\mathcal\{O\}sim8B\-Mid improves over Qwen3\-8B\-Base on 18 benchmarks, raising the average from 26\.9 to 41\.1\. The final𝒪\\mathcal\{O\}sim8B further improves the average to 64\.6, outperforming both the midtrained model and Qwen3\-8B\-Instruct on all 23 benchmarks \(full results, including 4B and instruct\-initialized variants, in Table[3](https://arxiv.org/html/2606.14199#A1.T3)\)\. Post\-training brings the largest gains on role\-playing tasks \(\+31\.5 over𝒪\\mathcal\{O\}sim8B\-Mid on average\), followed by conversational \(\+21\.0\) and evaluation \(\+19\.6\) tasks\.

Compared with prior specialized human\-simulation models,𝒪\\mathcal\{O\}sim8B is also consistently stronger\. The “Others” column reports the best specialized model per benchmark, yet our model outperforms it on 22 of 23 benchmarks, with an average gain of 14\.5 points\. This suggests that our unified training recipe generalizes better than narrowly specialized models\. Category\-level results show both strengths and limitations\.𝒪\\mathcal\{O\}sim8B is strongest on conversational simulation and social skills, and remains competitive on role\-playing and evaluation tasks\. However, it still lags behind the strongest frontier models on cognitive and theory\-of\-mind benchmarks such as Paratomi and Social\-R1, suggesting that further reasoning\-oriented training may be needed\.

#### RL dynamics\.

Figure[7](https://arxiv.org/html/2606.14199#S5.F7)shows the training trajectories of task\-specific RL experts on 23 tasks\. Overall, RL yieldsconsistent gains across tasks: the average score increases rapidly at the beginning and then steadily saturates, indicating that theSoulrewards provide effective optimization signals\. On many tasks, the final RL experts reach or surpass frontier\-model baselines, including UserLLM, Sotopia\-Hard, BehaviorChain, Paratomi, and Humanual\-Book\. This shows that small task\-specialized experts can achieve frontier\-level behavioral performance when optimized with task\-specific feedback\. The figure also highlights the effect and limitation of expert distillation\. The distilled𝒪\\mathcal\{O\}sim8B is generally below the best per\-task expert, suggesting that merging many specialized behaviors into a single model remains challenging\. Nevertheless, it stays close to the experts on many tasks and remains competitive with frontier models, showing that distillation transfers a substantial portion of the RL gains into one unified 8B model\.

#### Reward hacking analysis\.

Because behavioral rewards are judge\-based and often non\-verifiable, we monitor RL training using task rewards together with auxiliary statistics of the model outputs, including hacking rate and response length\. Figure[8](https://arxiv.org/html/2606.14199#S5.F8)shows three failure modes and our fixes\.

We observe two main forms of reward hacking in our RL experiments\. First, in Sotopia, the model can exploit the multi\-dimensional LLM judge by inserting evaluation\-like statements into the dialogue, such as explicitly claiming that the relationship score should be perfect, instead of improving the underlying social interaction\. To mitigate this shortcut, we add an LLM\-based hacking detector that identifies such judge\-targeting behaviors and applies a penalty during training\. This reduces the detected hacking rate from roughly 20%–25% to near 0%\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/figures/reward_hack.png)Figure 8:Reward hacking analysis\.We monitor hacking rate and response length during RL\. Our fixes suppress judge manipulation in Sotopia and prevent short\-response collapse in Humanual and Coser, yielding healthier optimization dynamics\.Second, in Humanual and Coser, error\-counting rewards introduce a different shortcut\. Because these judges subtract points for detected mistakes, the model can increase reward by producing short, generic responses that contain fewer checkable claims, rather than by better matching human behavior\. In Humanual, this leads to collapsed replies such as “I’m okay” or “I’m fine”; in Coser, it leads to strong response\-length distortion\. We therefore replace error\-counting with rubric\-based scoring\(Wang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib53)\), and, for Humanual, further add length\-matching and lexical\-overlap rewards against ground\-truth responses\. After these fixes, the model no longer collapses to short generic replies, and the response\-length distribution becomes substantially more realistic\.

Overall, these results show thatmonitoring behavioral statistics is essential for RL on human\-simulation tasks: judge\-based rewards may otherwise favor superficial shortcuts over genuine behavioral fidelity\. Ablation studies on the dataset and RL algorithm are in Appendix[C](https://arxiv.org/html/2606.14199#A3)\.

### 5\.3Out\-of\-Distribution Evaluation:𝒪\\mathcal\{O\}simas a User Simulator

To test whether𝒪\\mathcal\{O\}simtransfers to concrete interactive uses outside the trained benchmark tasks, we stress\-test it on*user simulation for agent evaluation*inτ\\tau\-bench\(Yao et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib59)\)\.

#### Setup\.

We evaluate onτ\\tau\-USI\(Zhou et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib68)\), a benchmark that scores how human\-like a user simulator is when it stands in for a real person interacting with a tool\-use agent\. A simulator plays the customer side of 165τ\\tau\-bench retail and airline tasks\(Yao et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib59)\)against a*fixed*GPT\-5\.2 agent, and the resulting conversations are compared to those of real humans performing the same tasks \(three independent annotation batches\)\. The composite*User Simulation Index*\(USI\) averages six components: four behavioral\-alignment scores measuring whether the simulator opens conversations \(Conv\), volunteers and withholds information \(Info\), asks for clarification \(Clarif\), and reacts to the agent \(React\) the way humans do, each scored as a Sørensen–Dice overlap with the human annotations; a post\-hoc survey\-agreement score \(Eval\); and a task\-success calibration term,1−ECE1\-\\text\{ECE\}, that checks whether the agent succeeds at the*same rate*with the simulator as with humans\.

While user simulation as a capability is related to our training axes, this domain, i\.e\., tool\-grounded customer\-service dialogue, is disjoint from𝒪\\mathcal\{O\}dysSim’s training data, so the evaluation is out\-of\-distribution at the domain level\. We plug𝒪\\mathcal\{O\}simin as the user side directly in its native chat format, with no task\-specific adapter, prompt engineering, or fine\-tuning\. We compare against representative models from the 31 simulators benchmarked byZhou et al\. \([2026b](https://arxiv.org/html/2606.14199#bib.bib68)\), spanning frontier LLMs and prior specialized behavioral simulators\. Please see more details in Appendix[A\.1](https://arxiv.org/html/2606.14199#A1.SS1)\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/x7.png)Figure 9:Out\-of\-distribution user simulation onτ\\tau\-USI\(Zhou et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib68)\)\. Each simulator interacts with a tool\-use agent through 165τ\\tau\-bench retail/airline tasks\(Yao et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib59)\); bars show alignment to human annotators on the four behavioral dimensions and the composite*USI*, averaged over three annotation batches, with the dashed line marking the human upper bound per metric\. We compare𝒪\\mathcal\{O\}sim\-8B and its midtrained checkpoint \(𝒪\\mathcal\{O\}sim\-8B\-Mid\) against𝒪\\mathcal\{O\}dysSim’s off\-the\-shelf base model \(Qwen3\-8B\-Instruct\), the strongest LLM simulator \(DeepSeek\-V3\.1\), a frontier GPT \(GPT\-5\.1\), and a representative specialized peer \(UserLM\-8B\); full results for all simulators are in[Table4](https://arxiv.org/html/2606.14199#A1.T4)\.
#### Results\.

[Figure9](https://arxiv.org/html/2606.14199#S5.F9)shows that𝒪\\mathcal\{O\}simtransfers strongly to this unseen interactive task, with gains concentrated in the dimensions that most directly measure human behavior\. Specifically,𝒪\\mathcal\{O\}sim\-8B achieves the strongest reaction alignment among all evaluated simulators, nearly matching real users \(React93\.293\.2vs\. human93\.593\.5\)\.𝒪\\mathcal\{O\}dysSimvariants also lead on information alignment \(𝒪\\mathcal\{O\}sim\-4B\-Mid, Info91\.591\.5\), indicating that the learned behavioral prior transfers to how users reveal, withhold, and respond to information in an unseen domain\. The composite USI score gives a similar picture with one caveat:𝒪\\mathcal\{O\}simis competitive with the leading simulator, DeepSeek\-V3\.1, whose small aggregate edge comes from survey\-agreement and calibration components \(Eval and ECE\), not stronger turn\-by\-turn behavioral alignment\. This pattern reinforces the paper’s central premise\. Some of the most heavily assistant\-tuned frontier models, including Gemini\-3\.1\-Pro and Claude\-Opus\-4, are among the weakest user simulators on behavioral alignment, consistent with their helpful, agreeable register diverging from how real users behave\.

As an orthogonal check, we also measure raw text human\-likeness with HumT\(Cheng et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib8)\), following the probe in[Section4](https://arxiv.org/html/2606.14199#S4); full scores are in[Table5](https://arxiv.org/html/2606.14199#A1.T5)\.𝒪\\mathcal\{O\}sim\-8B scores11\.911\.9, substantially above both the most human\-like frontier model in this comparison \(Gemini\-3\.1\-Pro,7\.07\.0\) and the off\-the\-shelf base \(Qwen3\-8B\-Instruct,4\.24\.2\)\. Together withτ\\tau\-USI, this suggests that𝒪\\mathcal\{O\}dysSimshifts the model away from a helpful\-assistant register and toward more diverse, realistic human behavior simulation\.

## 6Conclusion

We presented𝒪\\mathcal\{O\}dysSim, an end\-to\-end investigation of behavioral foundation models spanning a unified taxonomy, a socially grounded corpus \(𝒪\\mathcal\{O\}dysSimcorpus\), a 23\-task evaluation suite \(Soul\-Index\), and a training recipe that combines midtraining, task\-specific RL, and expert distillation\. The resulting models reach frontier\-level average performance onSoul\-Index and outperform prior open behavioral\-simulation baselines\. The broader implication is that behavioral foundation models need more than stronger prompting or generic instruction tuning\. Our results point to a pipeline in which broad, socially grounded midtraining first moves the model away from the homogeneous assistant register, and task\-specific RL then sharpens behavior across the concrete settings represented bySoul\-Index\. The reward\-hacking analysis adds an important caveat: behavioral rewards must be monitored carefully, since higher judge scores can reflect shortcuts such as judge manipulation or response\-length collapse rather than genuine behavioral fidelity\.

Our findings suggest that building general human simulators is less a matter of making a model more helpful and more a matter of constructing the right behavioral substrate: diverse social data, axis\-aware evaluation, task\-grounded optimization, and safeguards for reward design\. We release the corpus,Soul\-Index, recipes, and checkpoints to make this pipeline reproducible and to support future work on multimodal, multilingual, and population\-aware behavioral simulation\.

## Ethics and Broader Impact

#### Dual\-use considerations\.

Behavioral foundation models are dual\-use: the same fidelity that makes them useful for evaluating agents, studying social interaction, and stress\-testing dialogue systems could also make automated interaction feel more human than users expect\. Misuse cases include impersonating real people or demographic groups, generating tailored persuasive or manipulative dialogue at scale, creating synthetic participants that are mistaken for human subjects, or using simulated users to optimize systems for engagement rather than welfare\. We therefore frame𝒪\\mathcal\{O\}simas a research artifact for controlled evaluation and analysis, not as an autonomous human stand\-in or a tool for representing any specific person or population\. Our release mitigates these risks through research\-oriented licensing and model\-card guidance, documentation of intended and discouraged uses, PII filtering and deduplication in the corpus, and explicit reminders that𝒪\\mathcal\{O\}dysSimoutputs should be labeled as synthetic and treated as behavioral baselines rather than real human evidence\.

#### Data privacy and consent\.

All component datasets are drawn from publicly released sources whose licenses permit derivative research use; we re\-distribute only what those licenses allow\. Our preprocessing pipeline removes conversations with personally identifiable content \([SectionE\.5](https://arxiv.org/html/2606.14199#A5.SS5)\), and we deduplicate aggressively at the conversation level with MinHash to limit memorization risk\. For sources collected from human participants \(PRISM, SocSci210,τ\\tau\-USI\), we rely on the original studies’ IRB approvals and participant consent terms; we do not re\-collect or re\-link participant identities\.

#### Limitations of behavioral fidelity\.

Better simulation of*average*human behavior does not imply better simulation of any specific population, and our corpus over\-represents English\-language internet sources and Western cultural contexts\. Practitioners using𝒪\\mathcal\{O\}dysSimfor downstream evaluation should treat its outputs as a behavioral baseline rather than a representative human sample\.

## References

- Argyle et al\. \(2023\)Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate\.Out of one, many: Using language models to simulate human samples\.*Political Analysis*, 2023\.URL[https://arxiv\.org/abs/2209\.06899](https://arxiv.org/abs/2209.06899)\.
- Bai et al\. \(2022\)Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El\-Showk, Nelson Elhage, Zac Hatfield\-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan\.Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022\.URL[https://arxiv\.org/abs/2204\.05862](https://arxiv.org/abs/2204.05862)\.
- Baron\-Cohen et al\. \(1985\)Simon Baron\-Cohen, Alan M\. Leslie, and Uta Frith\.Does the autistic child have a “theory of mind”?*Cognition*, 21\(1\):37–46, 1985\.URL[https://doi\.org/10\.1016/0010\-0277\(85\)90022\-8](https://doi.org/10.1016/0010-0277(85)90022-8)\.
- Binz et al\. \(2025\)Marcel Binz, Elif Akata, Matthias Bethge, Franziska Brändle, Fred Callaway, Julian Coda\-Forno, Peter Dayan, Can Demircan, Maria K\. Eckstein, Noémi Éltető, Thomas L\. Griffiths, Susanne Haridi, Akshay K\. Jagadish, Li Ji\-An, Alexander Kipnis, Sreejan Kumar, Tobias Ludwig, Marvin Mathony, Marcelo Mattar, Alireza Modirshanechi, Surabhi S\. Nath, Joshua C\. Peterson, Milena Rmus, Evan M\. Russek, Tankred Saanum, Johannes A\. Schubert, Luca M\. Schulze Buschoff, Nishad Singhi, Xin Sui, Mirko Thalmann, Fabian Theis, Vuong Truong, Vishaal Udandarao, Konstantinos Voudouris, Robert Wilson, Kristin Witte, Shuchen Wu, Dirk Wulff, Huadong Xiong, and Eric Schulz\.Centaur: a foundation model of human cognition, 2025\.URL[https://arxiv\.org/abs/2410\.20268](https://arxiv.org/abs/2410.20268)\.
- Chang et al\. \(2020\)Jonathan P\. Chang, Caleb Chiam, Liye Fu, Andrew Z\. Wang, Justine Zhang, and Cristian Danescu\-Niculescu\-Mizil\.Convokit: A toolkit for the analysis of conversations\.In*Proceedings of SIGDIAL*, 2020\.URL[https://arxiv\.org/abs/2005\.04246](https://arxiv.org/abs/2005.04246)\.
- Chawla et al\. \(2021\)Kushal Chawla, Jaysa Ramirez, Rene Clever, Gale Lucas, Jonathan May, and Jonathan Gratch\.Casino: A corpus of campsite negotiation dialogues for automatic negotiation systems\.In*NAACL*, 2021\.URL[https://arxiv\.org/abs/2103\.15721](https://arxiv.org/abs/2103.15721)\.
- Chen et al\. \(2025\)Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey\.Persona vectors: Monitoring and controlling character traits in language models, 2025\.URL[https://arxiv\.org/abs/2507\.21509](https://arxiv.org/abs/2507.21509)\.
- Cheng et al\. \(2025\)Myra Cheng, Sunny Yu, and Dan Jurafsky\.HumT DumT: Measuring and controlling human\-like language in LLMs\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2025\.URL[https://aclanthology\.org/2025\.acl\-long\.1261/](https://aclanthology.org/2025.acl-long.1261/)\.
- Cialdini \(2007\)Robert B\. Cialdini\.*Influence: The Psychology of Persuasion*\.Harper Collins, revised edition, 2007\.URL[https://www\.harpercollins\.com/products/influence\-the\-psychology\-of\-persuasion\-revised\-edition\-robert\-b\-cialdini](https://www.harpercollins.com/products/influence-the-psychology-of-persuasion-revised-edition-robert-b-cialdini)\.
- Danescu\-Niculescu\-Mizil & Lee \(2011\)Cristian Danescu\-Niculescu\-Mizil and Lillian Lee\.Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs\.In*Workshop on Cognitive Modeling and Computational Linguistics*, 2011\.URL[https://arxiv\.org/abs/1106\.3077](https://arxiv.org/abs/1106.3077)\.
- Dou et al\. \(2025\)Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, and Jianfeng Gao\.Simulatorarena: Are user simulators reliable proxies for multi\-turn evaluation of ai assistants?In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2025\.URL[https://arxiv\.org/abs/2510\.05444](https://arxiv.org/abs/2510.05444)\.
- Du et al\. \(2025\)Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, and Yiqun Liu\.Twinvoice: A multi\-dimensional benchmark towards digital twins via llm persona simulation, 2025\.URL[https://arxiv\.org/abs/2510\.25536](https://arxiv.org/abs/2510.25536)\.
- Emelin et al\. \(2021\)Denis Emelin, Ronan Le Bras, Jena D Hwang, Maxwell Forbes, and Yejin Choi\.Moral stories: Situated reasoning about norms, intents, actions, and their consequences\.*arXiv preprint arXiv:2012\.15738*, 2021\.URL[https://arxiv\.org/abs/2012\.15738](https://arxiv.org/abs/2012.15738)\.
- Gururangan et al\. \(2020\)Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A\. Smith\.Don’t stop pretraining: Adapt language models to domains and tasks\.In*ACL*, 2020\.URL[https://arxiv\.org/abs/2004\.10964](https://arxiv.org/abs/2004.10964)\.
- Hathidara et al\. \(2026\)Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, and Anil Babu Ankisettipalli\.Mirrorbench: A benchmark to evaluate conversational user\-proxy agents for human\-likeness, 2026\.URL[https://arxiv\.org/abs/2601\.08118](https://arxiv.org/abs/2601.08118)\.
- He et al\. \(2023\)Yinghui He, Yufan Wu, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng\.Hi\-tom: A benchmark for evaluating higher\-order theory of mind reasoning in large language models\.In*Findings of EMNLP*, 2023\.URL[https://arxiv\.org/abs/2310\.16755](https://arxiv.org/abs/2310.16755)\.
- Hübotter et al\. \(2026\)Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause\.Reinforcement learning via self\-distillation\.*arXiv preprint arXiv:2601\.20802*, 2026\.URL[https://arxiv\.org/abs/2601\.20802](https://arxiv.org/abs/2601.20802)\.
- Hymes \(1972\)Dell Hymes\.On communicative competence\.In J\. B\. Pride and J\. Holmes \(eds\.\),*Sociolinguistics: Selected Readings*, pp\. 269–293\. Penguin, 1972\.URL[https://wwnorton\.com/books/9780393092264](https://wwnorton.com/books/9780393092264)\.
- Jiang et al\. \(2026\)Bowen Jiang, Taiwei Shi, Ryo Kamoi, Yuan Yuan, Camillo J\. Taylor, Longqi Yang, Pei Zhou, and Sihao Chen\.One model, all roles: Multi\-turn, multi\-agent self\-play reinforcement learning for conversational social intelligence, 2026\.URL[https://arxiv\.org/abs/2602\.03109](https://arxiv.org/abs/2602.03109)\.
- Jiang et al\. \(2025\)Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, and Yejin Choi\.Artificial hivemind: The open\-ended homogeneity of language models \(and beyond\)\.*arXiv preprint arXiv:2510\.22954*, 2025\.URL[https://arxiv\.org/abs/2510\.22954](https://arxiv.org/abs/2510.22954)\.
- Kim et al\. \(2023\)Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Le Bras, Gunhee Kim, Yejin Choi, and Maarten Sap\.Fantom: A benchmark for stress\-testing machine theory of mind in interactions\.In*EMNLP*, 2023\.URL[https://arxiv\.org/abs/2310\.15421](https://arxiv.org/abs/2310.15421)\.
- Kirk et al\. \(2024\)Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A\. Hale\.The prism alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models\.*Advances in Neural Information Processing Systems*, 37:105236–105344, 2024\.URL[https://arxiv\.org/abs/2404\.16019](https://arxiv.org/abs/2404.16019)\.
- Kolluri et al\. \(2025\)Akaash Kolluri, Shengguang Wu, Joon Sung Park, and Michael S Bernstein\.Finetuning llms for human behavior prediction in social science experiments\.In*EMNLP*, 2025\.URL[https://arxiv\.org/abs/2509\.05830](https://arxiv.org/abs/2509.05830)\.
- Kyung et al\. \(2025\)Daeun Kyung, Hyunseung Chung, Seongsu Bae, Jiho Kim, Jae Ho Sohn, Taerim Kim, Soo Kyung Kim, and Edward Choi\.Patientsim: A persona\-driven simulator for realistic doctor\-patient interactions, 2025\.URL[https://arxiv\.org/abs/2505\.17818](https://arxiv.org/abs/2505.17818)\.
- Köpf et al\. \(2023\)Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi\-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick\.Openassistant conversations – democratizing large language model alignment, 2023\.URL[https://arxiv\.org/abs/2304\.07327](https://arxiv.org/abs/2304.07327)\.
- Le et al\. \(2019\)Matthew Le, Y\-Lan Boureau, and Maximilian Nickel\.Revisiting the evaluation of theory of mind through question answering\.In*EMNLP*, 2019\.URL[https://aclanthology\.org/D19\-1598/](https://aclanthology.org/D19-1598/)\.
- Lei et al\. \(2026\)Yuxuan Lei, Tianfu Wang, Jianxun Lian, Zhengyu Hu, Defu Lian, and Xing Xie\.Humanllm: Towards personalized understanding and simulation of human nature, 2026\.URL[https://arxiv\.org/abs/2601\.15793](https://arxiv.org/abs/2601.15793)\.
- Li et al\. \(2025a\)Jia\-Nan Li, Jian Guan, Songhao Wu, Wei Wu, and Rui Yan\.From 1,000,000 users to every user: Scaling up personalized preference for user\-level alignment, 2025a\.URL[https://arxiv\.org/abs/2503\.15463](https://arxiv.org/abs/2503.15463)\.
- Li et al\. \(2025b\)Wenkai Li, Jiarui Liu, Andy Liu, Xuhui Zhou, Mona T\. Diab, and Maarten Sap\.BIG5\-CHAT: Shaping LLM personalities through training on human\-grounded data\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar \(eds\.\),*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 20434–20471, Vienna, Austria, July 2025b\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.doi:10\.18653/v1/2025\.acl\-long\.999\.URL[https://aclanthology\.org/2025\.acl\-long\.999/](https://aclanthology.org/2025.acl-long.999/)\.
- Li et al\. \(2017\)Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu\.Dailydialog: A manually labelled multi\-turn dialogue dataset\.In*IJCNLP*, 2017\.URL[https://arxiv\.org/abs/1710\.03957](https://arxiv.org/abs/1710.03957)\.
- Liu et al\. \(2026\)Emmy Liu, Graham Neubig, and Chenyan Xiong\.Midtraining bridges pretraining and posttraining distributions, 2026\.URL[https://arxiv\.org/abs/2510\.14865](https://arxiv.org/abs/2510.14865)\.
- Liu et al\. \(2021\)Siyang Liu, Chujie Zheng, Ori Dember, Sahand Sabour, Yu Li, Dianbo Yu, Yun Jiang, and Minlie Huang\.Towards emotional support dialog systems\.In*ACL*, 2021\.URL[https://arxiv\.org/abs/2106\.01144](https://arxiv.org/abs/2106.01144)\.
- Mo et al\. \(2025\)Kaixiang Mo, Yuxin Shi, Weiwei Weng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, and Anxiang Zeng\.Mid\-training of large language models: A survey, 2025\.URL[https://arxiv\.org/abs/2510\.06826](https://arxiv.org/abs/2510.06826)\.
- Naous et al\. \(2026\)Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville\.Flipping the dialogue: Training and evaluating user language models, 2026\.URL[https://arxiv\.org/abs/2510\.06552](https://arxiv.org/abs/2510.06552)\.
- OpenAI \(2025\)OpenAI\.Introducing GPT\-5\.5\.[https://openai\.com/index/introducing\-gpt\-5\-5/](https://openai.com/index/introducing-gpt-5-5/), 2025\.Accessed 2026\-05\-06\.
- Ouyang et al\. \(2022\)Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe\.Training language models to follow instructions with human feedback\.*Advances in Neural Information Processing Systems*, 2022\.URL[https://arxiv\.org/abs/2203\.02155](https://arxiv.org/abs/2203.02155)\.
- Park et al\. \(2023\)Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein\.Generative agents: Interactive simulacra of human behavior\.In*UIST*, 2023\.URL[https://arxiv\.org/abs/2304\.03442](https://arxiv.org/abs/2304.03442)\.
- Qwen Team \(2025\)Qwen Team\.Qwen3 technical report, 2025\.URL[https://arxiv\.org/abs/2505\.09388](https://arxiv.org/abs/2505.09388)\.
- Rashkin et al\. \(2019\)Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y\-Lan Boureau\.Towards empathetic open\-domain conversation models: A new benchmark and dataset\.In*ACL*, 2019\.URL[https://arxiv\.org/abs/1811\.00207](https://arxiv.org/abs/1811.00207)\.
- Ross & Andreas \(2025a\)Alexis Ross and Jacob Andreas\.Learning to make mistakes: Modeling incorrect student thinking and key errors, 2025a\.URL[https://arxiv\.org/abs/2510\.11502](https://arxiv.org/abs/2510.11502)\.
- Ross & Andreas \(2025b\)Alexis Ross and Jacob Andreas\.Learning to make mistakes: Modeling incorrect student thinking and key errors, 2025b\.URL[https://arxiv\.org/abs/2510\.11502](https://arxiv.org/abs/2510.11502)\.
- Rozière et al\. \(2024\)Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve\.Code llama: Open foundation models for code, 2024\.URL[https://arxiv\.org/abs/2308\.12950](https://arxiv.org/abs/2308.12950)\.
- Sap et al\. \(2019\)Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi\.Social iqa: Commonsense reasoning about social interactions\.In*EMNLP*, 2019\.URL[https://arxiv\.org/abs/1904\.09728](https://arxiv.org/abs/1904.09728)\.
- Sclar et al\. \(2023\)Melanie Sclar, Sachin Kumar, Peter West, Alane Suhr, Yejin Choi, and Yulia Tsvetkov\.Minding language models’ \(lack of\) theory of mind: A plug\-and\-play multi\-character belief tracker, 2023\.URL[https://arxiv\.org/abs/2306\.00924](https://arxiv.org/abs/2306.00924)\.
- Shi et al\. \(2026\)Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, and Jieyu Zhao\.Experiential reinforcement learning\.*arXiv preprint arXiv:2602\.13949*, 2026\.URL[https://arxiv\.org/abs/2602\.13949](https://arxiv.org/abs/2602.13949)\.
- Sileo \(2023\)Damien Sileo\.Mindgames: Targeting theory of mind in large language models with dynamic epistemic modal logic\.*arXiv preprint arXiv:2305\.05110*, 2023\.URL[https://arxiv\.org/abs/2305\.03353](https://arxiv.org/abs/2305.03353)\.
- Song et al\. \(2026\)Yuda Song, Lili Chen, Fahim Tajwar, Rémi Munos, Deepak Pathak, J\. Andrew Bagnell, Aarti Singh, and Andrea Zanette\.Expanding the capabilities of reinforcement learning via text feedback\.*arXiv preprint arXiv:2602\.02482*, 2026\.URL[https://arxiv\.org/abs/2602\.02482](https://arxiv.org/abs/2602.02482)\.
- StepFun AI \(2025\)StepFun AI\.Step\-3\.5\-flash\-sft\.Hugging Face Datasets, 2025\.URL[https://huggingface\.co/datasets/stepfun\-ai/Step\-3\.5\-Flash\-SFT](https://huggingface.co/datasets/stepfun-ai/Step-3.5-Flash-SFT)\.
- Sun et al\. \(2026\)Weiwei Sun, Xuhui Zhou, Jiarui Liu, Weihua Du, Haojia Sun, Yiqing Xie, Qianou Ma, Sihao Chen, Mengting Wan, Longqi Yang, Pei Zhou, Sherry Wu, Sean Welleck, Graham Neubig, Yiming Yang, and Maarten Sap\.Reinforcing human behavior simulation via verbal feedback, 2026\.URL[https://arxiv\.org/abs/2605\.20506](https://arxiv.org/abs/2605.20506)\.
- Tirinzoni et al\. \(2024\)Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu, Alessandro Lazaric, and Matteo Pirotta\.Zero\-shot whole\-body humanoid control via behavioral foundation models\.[https://github\.com/facebookresearch/metamotivo](https://github.com/facebookresearch/metamotivo), 2024\.
- Verga et al\. \(2024\)Pat Verga, Sebastian Hofstätter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis\.Replacing judges with juries: Evaluating llm generations with a panel of diverse models\.*arXiv preprint arXiv:2404\.18796*, 2024\.URL[https://arxiv\.org/abs/2404\.18796](https://arxiv.org/abs/2404.18796)\.
- Wang et al\. \(2024\)Ruiyi Wang, Haofei Yu, Wenxin Zhang, Zhengyang Qi, Maarten Sap, Graham Neubig, Yonatan Bisk, and Hao Zhu\.Sotopia\-π\\pi: Interactive learning of socially intelligent language agents, 2024\.URL[https://arxiv\.org/abs/2403\.08715](https://arxiv.org/abs/2403.08715)\.
- Wang et al\. \(2026\)Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, and Yanghua Xiao\.Coser: A comprehensive literary dataset and framework for training and evaluating llm role\-playing and persona simulation, 2026\.URL[https://arxiv\.org/abs/2502\.09082](https://arxiv.org/abs/2502.09082)\.
- Wang et al\. \(2019\)Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and Zhou Yu\.Persuasion for good: Towards a personalized persuasive dialogue system for social good\.In*ACL*, 2019\.URL[https://arxiv\.org/abs/1906\.06725](https://arxiv.org/abs/1906.06725)\.
- Wu et al\. \(2026a\)Jincenzi Wu, Yuxuan Lei, Jianxun Lian, Yitian Huang, Lexin Zhou, Haotian Li, Xing Xie, and Helen Meng\.Social\-r1: Towards human\-like social reasoning in llms, 2026a\.URL[https://arxiv\.org/abs/2603\.09249](https://arxiv.org/abs/2603.09249)\.
- Wu et al\. \(2026b\)Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He\-Yueya, Tharindu Cyril Weerasooriya, Wei Wei, Diyi Yang, Jure Leskovec, and James Zou\.Humanlm: Simulating users with state alignment beats response imitation, 2026b\.URL[https://arxiv\.org/abs/2603\.03303](https://arxiv.org/abs/2603.03303)\.
- Xie et al\. \(2025\)Yutong Xie, Zhuoheng Li, Xiyuan Wang, Yijun Pan, Qijia Liu, Xingzhi Cui, Kuang\-Yu Lo, Ruoyi Gao, Xingjian Zhang, Jin Huang, Walter Yuan, Matthew O\. Jackson, and Qiaozhu Mei\.Be\.fm: Open foundation models for human behavior, 2025\.URL[https://arxiv\.org/abs/2505\.23058](https://arxiv.org/abs/2505.23058)\.
- Xu et al\. \(2024\)Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao\.Character is destiny: Can role\-playing language agents make persona\-driven decisions?*arXiv preprint arXiv:2404\.12138*, 2024\.URL[https://arxiv\.org/abs/2404\.12138](https://arxiv.org/abs/2404.12138)\.
- Yao et al\. \(2024\)Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan\.τ\\tau\-bench: A benchmark for tool\-agent\-user interaction in real\-world domains, 2024\.URL[https://arxiv\.org/abs/2406\.12045](https://arxiv.org/abs/2406.12045)\.
- Yu et al\. \(2025\)Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad Majumder, Hao Zhu, Paul Pu Liang, and Jiaxuan You\.Sotopia\-rl: Reward design for social intelligence, 2025\.URL[https://arxiv\.org/abs/2508\.03905](https://arxiv.org/abs/2508.03905)\.
- Zeng et al\. \(2025\)Weishuai Zeng, Shunlin Lu, Kangning Yin, et al\.Behavior foundation model for humanoid robots, 2025\.URL[https://arxiv\.org/abs/2509\.13780](https://arxiv.org/abs/2509.13780)\.
- Zhao et al\. \(2024\)Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng\.Wildchat: 1m chatgpt interaction logs in the wild, 2024\.URL[https://arxiv\.org/abs/2405\.01470](https://arxiv.org/abs/2405.01470)\.
- Zheng et al\. \(2023\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P\. Xing, Hao Zhang, Joseph E\. Gonzalez, and Ion Stoica\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.In*Advances in Neural Information Processing Systems*, volume 36, 2023\.URL[https://arxiv\.org/abs/2306\.05685](https://arxiv.org/abs/2306.05685)\.
- Zheng et al\. \(2024\)Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P\. Xing, Joseph E\. Gonzalez, Ion Stoica, and Hao Zhang\.Lmsys\-chat\-1m: A large\-scale real\-world llm conversation dataset, 2024\.URL[https://arxiv\.org/abs/2309\.11998](https://arxiv.org/abs/2309.11998)\.
- Zhou et al\. \(2024a\)Xuhui Zhou, Hyunwoo Kim, Faeze Brahman, Liwei Jiang, Hao Zhu, Ximing Lu, Frank Xu, Bill Yuchen Lin, Yejin Choi, Niloofar Mireshghallah, Ronan Le Bras, and Maarten Sap\.Haicosystem: An ecosystem for sandboxing safety risks in human\-ai interactions\.*arXiv preprint arXiv:2409\.16427*, 2024a\.URL[https://arxiv\.org/abs/2409\.16427](https://arxiv.org/abs/2409.16427)\.
- Zhou et al\. \(2024b\)Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis\-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap\.Sotopia: Interactive evaluation for social intelligence in language agents\.In*ICLR*, 2024b\.URL[https://arxiv\.org/abs/2310\.11667](https://arxiv.org/abs/2310.11667)\.
- Zhou et al\. \(2026a\)Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, and Maarten Sap\.Mind the sim2real gap in user simulation for agentic tasks, 2026a\.URL[https://arxiv\.org/abs/2603\.11245](https://arxiv.org/abs/2603.11245)\.
- Zhou et al\. \(2026b\)Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, and Maarten Sap\.Mind the sim2real gap in user simulation for agentic tasks, 2026b\.URL[https://arxiv\.org/abs/2603\.11245](https://arxiv.org/abs/2603.11245)\.
- Zhu et al\. \(2023\)Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, Karthik Ganesan, Wei\-Lin Chiang, Jian Zhang, and Jiantao Jiao\.Starling\-7b: Improving llm helpfulness and harmlessness with rlaif\.*arXiv preprint*, 2023\.URL[https://huggingface\.co/datasets/berkeley\-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar)\.

## Appendix AFull Results

Table 3:Full Results\.Per\-task scores and the unweighted 23\-task average \(Avg\) for all evaluated models\.\+ Middenotes midtraining on the𝒪\\mathcal\{O\}dysSimcorpus;\+ Postdenotes task\-specific RL followed by expert distillation\. The headline𝒪\\mathcal\{O\}sim8B of Table[2](https://arxiv.org/html/2606.14199#S5.T2)is Qwen3\-8B\-Base \+ Mid \+ Post, and𝒪\\mathcal\{O\}sim8B\-Mid is Qwen3\-8B\-Base \+ Mid\. Ditto\-v2 8B is the verbal\-feedback RL model ofSun et al\. \([2026](https://arxiv.org/html/2606.14199#bib.bib49)\), post\-trained directly from Qwen3\-8B\-Instruct\.CONVSSCOGROLEEVALModelAvgUserLLM

MirrorBench

Human\-Chat

SimArena\-Doc

Sotopia\-Hard

Fantom

Hitom

Paratomi

Social\-R1

Coser

Lifechoices

Twinvoice

BehaviorChain

SimArena\-Math

Mistakes

Human\-Email

Human\-News

Human\-Politics

AlignX

Humanllm

Socsci210

Human\-Book

Human\-Opinion

Proprietary ModelsGPT 5\.565\.265\.356\.728\.283\.431\.993\.082\.099\.069\.066\.291\.074\.095\.068\.572\.050\.140\.242\.071\.245\.777\.257\.639\.8GPT 5\.4 Nano53\.048\.945\.024\.683\.729\.580\.083\.075\.048\.053\.564\.034\.038\.069\.058\.047\.136\.438\.767\.434\.674\.343\.742\.3GPT 5\.4 Mini58\.252\.555\.626\.783\.328\.584\.078\.082\.058\.058\.872\.044\.072\.067\.457\.050\.339\.841\.768\.639\.975\.255\.646\.9Gemini 3\.1 Pro64\.867\.748\.321\.083\.027\.893\.086\.097\.079\.062\.184\.086\.092\.071\.573\.046\.942\.332\.573\.446\.978\.062\.436\.0Claude Opus 4\.765\.557\.663\.722\.683\.532\.480\.093\.090\.067\.066\.592\.083\.096\.068\.774\.050\.441\.343\.571\.644\.277\.261\.446\.2Qwen 3\.6 Plus61\.172\.148\.022\.282\.428\.389\.073\.094\.067\.055\.979\.071\.085\.070\.967\.047\.941\.831\.669\.842\.774\.558\.434\.2Open\-Source Specialized ModelsUserLM 8B13\.437\.39\.74\.277\.717\.81\.00\.011\.03\.03\.513\.01\.05\.061\.51\.08\.72\.55\.926\.83\.81\.83\.49\.1Coser 8B19\.544\.423\.62\.575\.625\.01\.00\.03\.00\.030\.044\.04\.08\.068\.00\.09\.215\.88\.426\.64\.321\.025\.48\.2HumanLM 8B48\.737\.245\.421\.782\.926\.770\.056\.075\.047\.019\.867\.040\.036\.069\.056\.042\.133\.133\.066\.835\.275\.248\.737\.4SotopiaRL 7B39\.744\.638\.325\.883\.531\.70\.031\.040\.046\.030\.362\.029\.021\.070\.517\.042\.830\.234\.258\.624\.369\.850\.232\.3OursQwen3\-4B\-Base47\.840\.945\.019\.682\.526\.551\.066\.057\.053\.034\.061\.040\.037\.066\.855\.043\.631\.529\.365\.631\.374\.653\.333\.8Qwen3\-4B\-Base \+ Mid39\.459\.641\.611\.080\.643\.965\.052\.070\.036\.015\.570\.028\.035\.069\.723\.022\.616\.113\.052\.66\.536\.839\.717\.5Qwen3\-4B\-Base \+ Post60\.588\.960\.123\.284\.345\.181\.068\.075\.056\.050\.679\.060\.078\.071\.261\.050\.138\.436\.874\.436\.874\.160\.339\.0Qwen3\-4B\-Base \+ Mid \+ Post62\.689\.868\.324\.584\.848\.082\.069\.081\.055\.055\.089\.063\.088\.071\.353\.048\.742\.439\.171\.837\.874\.363\.340\.5Qwen3\-8B\-Base26\.931\.013\.912\.079\.621\.423\.012\.019\.037\.06\.132\.019\.018\.066\.224\.026\.412\.717\.849\.012\.146\.621\.518\.2Qwen3\-8B\-Base \+ Mid41\.149\.549\.17\.880\.345\.662\.054\.072\.042\.024\.858\.025\.042\.068\.118\.022\.315\.115\.453\.616\.568\.138\.817\.0Qwen3\-8B\-Base \+ Post63\.890\.563\.028\.585\.449\.775\.074\.084\.061\.059\.680\.068\.089\.070\.756\.053\.242\.940\.473\.040\.577\.763\.441\.1Qwen3\-8B\-Base \+ Mid \+ Post64\.690\.168\.328\.284\.149\.280\.079\.083\.060\.062\.682\.068\.094\.070\.759\.051\.442\.741\.972\.639\.175\.163\.242\.0Qwen3\-8B\-Inst48\.346\.054\.024\.783\.627\.723\.062\.067\.054\.043\.570\.042\.041\.068\.927\.043\.732\.533\.268\.634\.173\.653\.637\.2Qwen3\-8B\-Inst \+ Mid43\.157\.148\.612\.180\.543\.166\.056\.068\.044\.024\.062\.025\.052\.068\.640\.024\.118\.013\.759\.217\.955\.638\.317\.3Qwen3\-8B\-Inst \+ Post†65\.392\.670\.624\.984\.348\.789\.078\.082\.059\.063\.979\.074\.093\.070\.666\.049\.244\.341\.072\.240\.975\.563\.640\.1Qwen3\-8B\-Inst \+ Post66\.091\.670\.026\.884\.449\.390\.076\.087\.064\.063\.979\.075\.092\.071\.762\.050\.144\.442\.174\.240\.477\.263\.642\.6Qwen3\-8B\-Inst \+ Mid \+ Post65\.793\.972\.828\.985\.148\.490\.079\.089\.060\.064\.173\.070\.091\.070\.861\.051\.743\.341\.273\.841\.577\.864\.241\.1Ditto\-v2 8B\(Sun et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib49)\)66\.592\.772\.027\.485\.149\.495\.082\.091\.061\.065\.383\.064\.093\.070\.765\.049\.443\.641\.673\.642\.175\.165\.241\.6### A\.1Full Out\-of\-Distribution Evaluation Results

[Tables4](https://arxiv.org/html/2606.14199#A1.T4)and[5](https://arxiv.org/html/2606.14199#A1.T5)report the full per\-model results for the two out\-of\-distribution checks in[Section5\.3](https://arxiv.org/html/2606.14199#S5.SS3): interactive user simulation onτ\\tau\-USI and raw text human\-likeness on HumT\.

Table 4:Out\-of\-distribution user simulation onτ\\tau\-USI\(Zhou et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib68)\)\. Each simulator drives a fixed GPT\-5\.2 tool\-use agent through 165τ\\tau\-bench retail/airline tasks\(Yao et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib59)\); we report alignment to human annotators on the four behavioral dimensions \(*Conv*ersation,*Info*rmation,*Clarif*ication,*React*ion; Sørensen–Dice×100\\times 100\), post\-hoc survey agreement \(*Eval*\), task\-success calibration error \(*ECE*, lower is better\), and the composite*USI*, averaged over three annotation batches\.Boldmarks the best non\-human entry per column\.𝒪\\mathcal\{O\}dysSimvariants are evaluated zero\-shot;τ\\tau\-bench’s tool\-agent domain is disjoint from theSoul\-Index training corpus\.User SimulatorConvInfoClarifReactEvalECE↓\\downarrowUSIHuman \(upper bound\)87\.497\.988\.093\.597\.40\.08192\.7Frontier LLMsDeepSeek\-V3\.145\.186\.674\.587\.674\.30\.11976\.1GPT\-5\.147\.377\.473\.388\.172\.10\.17273\.5Gemini\-3\.1\-Pro44\.167\.148\.945\.375\.10\.10161\.7Claude\-Opus\-432\.671\.946\.644\.973\.40\.13959\.2Specialized behavioral simulatorsCoSER\-8B\(Wang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib53)\)37\.871\.571\.669\.963\.30\.10967\.2UserLM\-8B\(Naous et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib34)\)30\.850\.856\.880\.067\.40\.14062\.0Human\-Like\-7B35\.755\.051\.665\.972\.80\.22059\.8HumanLM\-8B\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)30\.119\.538\.550\.761\.60\.19246\.9Off\-the\-shelf instruct model \(𝒪\\mathcal\{O\}dysSim’s base checkpoint\)Qwen3\-8B\-Instruct50\.574\.380\.280\.372\.20\.29371\.4𝒪\\mathcal\{O\}dysSim\(ours\)𝒪\\mathcal\{O\}sim\-8B46\.488\.279\.793\.270\.00\.25575\.4𝒪\\mathcal\{O\}sim\-8B\-Mid45\.583\.563\.776\.169\.10\.35667\.1𝒪\\mathcal\{O\}sim\-Inst\-8B45\.185\.274\.183\.674\.40\.33971\.4𝒪\\mathcal\{O\}sim\-4B30\.066\.376\.282\.270\.50\.24266\.8𝒪\\mathcal\{O\}sim\-4B\-Mid51\.191\.565\.582\.669\.00\.24272\.6𝒪\\mathcal\{O\}sim\-Inst\-4B37\.873\.462\.662\.172\.10\.23664\.1Table 5:Human\-likeness beyondSoul\-Index \(HumT\(Cheng et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib8)\);×100\\times 100, higher is more human\-like\)\.Per\-text anthropomorphism scalar—the log\-prob ratio of animate vs\. inanimate prefixes under a fixed GPT\-2 backbone—measured on HumT’s held\-out prompts, a signal entirely outside theSoul\-Index training objective\.Othersis the best prior human\-simulation model\.Boldmarks the best entry\.GPT5\.5Gemini3\.1 ProClaudeOpus 4\.7Qwen3\.6 PlusOthers\*Qwen38B InstBase8B𝒪\\mathcal\{O\}sim8B\-Mid𝒪\\mathcal\{O\}sim8BHumT4\.67\.03\.73\.82\.94\.22\.113\.411\.9

## Appendix BLimitations and Future Work

Our investigation is text\-only, but human behavior is fundamentally multimodal \(voice, gesture, facial expression\)\. The𝒪\\mathcal\{O\}dysSimcorpus, while diverse, still consists largely of*performed*behavior \(conversations typed for an audience\) rather than naturalistic decision traces, and our evaluation depends on LLM judges that may carry their own biases\. Promising directions include extending to multimodal behavior, scaling to larger base models, and testing whether behavioral midtraining transfers across languages and cultures\.

## Appendix CAblation

Table 6:Ablation over training stages\.Mid\-Traindenotes mid\-stage SFT andPost\-Traindenotes post\-training \(RL\)\.Avgis the unweighted mean across the 23 evaluation tasks\.PretrainedMid\-TrainPost\-TrainAvgBase 4B47\.8Base 4B✓39\.4Base 4B✓60\.5Base 4B✓✓62\.6Base 8B26\.9Base 8B✓41\.1Base 8B✓63\.8Base 8B✓✓64\.6Instruct 8B48\.3Instruct 8B✓43\.1Instruct 8B✓66\.0Instruct 8B✓✓65\.7#### Data ablation

Table[6](https://arxiv.org/html/2606.14199#A3.T6)shows thatpost\-training contributes the largest gains\. RL alone raises the average score from 47\.8 to 60\.5 for Base 4B, 26\.9 to 63\.8 for Base 8B, and 48\.3 to 66\.0 for Instruct 8B\. Midtraining further improves the post\-trained base models, reaching 62\.6 for Base 4B and 64\.6 for Base 8B\. Overall, these results suggest thatRL is the key step for behavioral alignment, while midtraining provides additional gains when applied before RL on pretrained base models\.

#### RL ablation

Figure[10](https://arxiv.org/html/2606.14199#A3.F10)shows an ablation study on the learning algorithm\. Specifically, we compare our RLVF \(RL with verbal feedback\) method with the following methods: \(1\) GRPO; \(2\) RLVF only\(y1,x\)\(y\_\{1\},x\), a variant that only trains on\(y1,x\)\(y\_\{1\},x\)pairs and removes the loss on\(y0,x\)\(y\_\{0\},x\)and\(y0,x\+h\)\(y\_\{0\},x\+h\)pairs; \(3\) ERL\(Shi et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib45)\), which uses a supervised fine\-tuning objective ony1y\_\{1\}instead of RL; and several SDPO\(Hübotter et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib17)\)variants including \(4\) SDPO\+ token, which combines SDPO with token\-level loss and the GRPO objective; \(5\) SDPO\+ logits, which combines SDPO with logits\-level loss and the GRPO objective; \(6\) SDPO token, i\.e\., SDPO with token\-level loss; and \(7\) SDPO logits, i\.e\., SDPO with logits\-level loss\.

From the results, we observe that RLVF performs better than GRPO on most sub\-metrics, especially on smaller metrics such assecret\. Notably,secretmeasures whether the agent avoids leaking private information—a safety\-critical dimension that is not directly optimized in the scalar RL reward but can be explicitly addressed through verbal feedback\. GRPO reduces all multi\-dimensional scores into a single scalar reward, which loses information and weakens learning signals for improving minor metrics, while RLVF can learn all dimensions effectively through fine\-grained feedback\. Compared to different algorithms that utilize feedback, we observe that reverse\-KL\-based methods such as SDPO collapse on most metrics exceptsecret\. Combining SDPO with GRPO improves performance but still performs worse than RLVF\. ERL also underperforms, likely because the lack of reward normalization makes learning unstable under noisy feedback\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/figures/ablation_reward.png)Figure 10:RL\-algorithm ablation on Sotopia\. Reward over training for our RLVF \(RL with verbal feedback\) against GRPO, ERL\(Shi et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib45)\), and the SDPO\(Hübotter et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib17)\)variants defined in the text\.

## Appendix DPost\-training Data Statistics

#### Construction\.

For everySoul\-Index task we maintain three matched data slices: a held\-out evaluation slice \(capped to 100 instances per task — 500 for HumanLLM — for fast judge\-based scoring\), a per\-task RL training set, and an expert\-distillation \(SFT\) bucket consolidating top\-scoring rollouts\. The RL slice is constructed by sampling 1,024 instances from each task’s training split \(Verl\-style on\-policy RL, batch 64, group size 8 rollouts/prompt, max input/output 8,192 tokens, LoRA rank 32 / alpha 64 on Qwen3\-8B\-Instruct\); when a task’s full training set is smaller than the 1,024 target, we use it in full \(e\.g\.,sotopia\-hard405,social\_r1687,sim\_doc815,sim\_math1,043,lifechoices1,150\)\. A few tasks deviate from the 1,024 default: HUMANUAL splits use 1,000 prompts each \(six domains→\\to6,000 total\),alignxpulls a wider 8,191\-prompt slice across its five sub\-splits, andsocsci210is downsampled to 2,000 from a 2\.4M pool\. We release the post\-training prompts, held\-out evaluation slices, and task\-organized train/test files as the HuggingFace dataset[cmu\-lti/osim\-post\-training](https://huggingface.co/datasets/cmu-lti/osim-post-training)\.

#### Reward signal\.

The RL algorithm is RLVF \(RL with verbal feedback\) for tasks scored by an LLM judge that returns both a scalar reward and a textual critique, and GRPO with the scalar reward only for tasks with verifiable rewards \(full per\-task assignment in[Table7](https://arxiv.org/html/2606.14199#A4.T7)\)\. We usegpt\-5\-nanoas the default judge / environment model, except for CoSER, where the literary\-character rubric is more delicate and we usegpt\-5\.4for improved judge robustness\. For multi\-agent tasks \(CoSER\), the episode\-level reward returned by the judge is assigned to each agent’s rollout so on\-policy advantages remain well\-defined\.

#### Expert distillation\.

After per\-task RL we run rejection\-sampled fine\-tuning \(RFT\) on the experts: for each prompt we draw 8 rollouts, keep the top\-1 by reward, deduplicate, and aggregate into 13 family\-level SFT files \([Table8](https://arxiv.org/html/2606.14199#A4.T8)\); these files form the*expert distillation*corpus used to merge the per\-task experts back into a single deployable𝒪\\mathcal\{O\}sim\-8B\. Family\-level grouping reflects shared\-skill task clusters:tom\_sftconsolidates Fantom/Hitom/Paratomi \(Theory of Mind\),humanual\_sftaggregates all six HUMANUAL domains,simarena\_sftcovers SimArena\-Math/Doc, andsim\_sftcaptures user\-simulation tasks with overlapping prompt distributions\.

Table 7:Per\-task evaluation and RL\-training data statistics\.Evalreports theSoul\-Index held\-out slice we score \(capped to 100 instances per task; HumanLLM uses 500\)\.RL Source,Original, andUsedreport the per\-task RL split, its full size, and the number of prompts we sample for RL training\.Algindicates RLVF \(RL with verbal feedback from an LLM judge\) or GRPO \(scalar verifiable reward\)\.TaskEvalRL SourceOriginalUsedAlgsotopia\-hard100sotopia\_clean\_rl405405RLVFcoser100coser\_rl\_train21,1751,024RLVFlifechoices100lifechoices\_hard\_rl1601,150GRPOuserllm100userllm\_rl\_train28,9181,024RLVFmirrorbench100mirrorbench\_rl\_train3,4001,024RLVFfantom100fantom\_rl\_train2,3661,024RLVFhitom100hitom\_rl\_train6,0001,024GRPOparatomi100paratomi\_rl\_train1,8631,024RLVFmistakes100mistakes\_rl\_train3,4941,024GRPOtwinvoice100twinvoice\_rl\_train—1,024RLVFsocial\_r1100social\_r1\_rl687687GRPObehaviorchain100behaviorchain\_rl\_train5,0001,024GRPOsim\_math100sim\_math\_rl1,0431,043GRPOsim\_doc100sim\_doc\_rl815815GRPOhumanual\-book100humanual\-book34,1701,000GRPOhumanual\-chat100humanual\-chat23,1411,000GRPOhumanual\-email100humanual\-email6,3771,000GRPOhumanual\-news100humanual\-news48,6181,000GRPOhumanual\-opinion100humanual\-opinion37,7911,000GRPOhumanual\-politics100humanual\-politics45,4291,000GRPOalignx\-demo100alignx\_rl\_8k74,8268,191GRPOalignx\-pair100alignx\-ugc100alignx\-arbitrary100alignx\-history16100socsci210100socsci210\_rl\_2k2,418,7482,000GRPOhumanllm500humanllm\_rl\_train185,9121,024GRPOTotal3,100——30,551—Table 8:Expert\-distillation \(SFT\) data: family\-level files derived from rollout\-filtered RFT data \(top\-1\-of\-8 reward filtering, then deduplication\)\. Each file aggregates one or moreSoul\-Index tasks \(e\.g\.,tom\_sftcovers Fantom / Hitom / Paratomi;humanual\_sftcovers all six HUMANUAL domains\)\.SFT FileRowsalignx\_sft7,720behaviorchain\_sft1,994coser\_sft4,096humanllm\_sft3,181humanual\_sft11,969lifechoices\_sft3,912mistakes\_sft3,077sim\_sft9,028simarena\_sft3,716social\_r1\_sft557socsci210\_sft2,000sotopia\_sft1,620tom\_sft5,832Total58,702

## Appendix EMidtraining Dataset Details

### E\.1Dataset Details and Sources

[Table9](https://arxiv.org/html/2606.14199#A5.T9)lists every released dataset by its identifier in the midtraining release[cmu\-lti/osim\-mid\-training](https://huggingface.co/datasets/cmu-lti/osim-mid-training), its display name, its canonical paper \(citation key in our bibliography\), and its primary source/download location\. Where no formal publication exists, or where the bibtex entry is not yet in this paper, the*Paper*column is marked “—”\. ConvoKit\-distributed corpora share a single download interface\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\); per\-corpus URLs are listed for direct retrieval\.

Table 9:Source bibliography for the 62 released datasets in[cmu\-lti/osim\-mid\-training](https://huggingface.co/datasets/cmu-lti/osim-mid-training), sorted by capability dimension matching[Figure3](https://arxiv.org/html/2606.14199#S3.F3)\. Dataset ids are abbreviated for compactness \(e\.g\.\-corpussuffix dropped on ConvoKit entries;conversations\-gone\-awry→\\toCGA;wiki\-articles\-for\-deletion→\\towiki\-AfD\); the released identifiers in the manifest retain their full names\.*Paper*: canonical citation; “—” indicates either no formal publication or no bibtex entry in this paper\.*Source*: primary download/code location; ConvoKit corpora are downloadable through the unified ConvoKit interface\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)\.\#Dataset idDisplay namePaperSource1wildchatWildChat\(Zhao et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib62)\)[https://huggingface\.co/datasets/allenai/WildChat\-4\.8M](https://huggingface.co/datasets/allenai/WildChat-4.8M)2lmsysLMSYS\-Chat\-1M\(Zheng et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib64)\)[https://huggingface\.co/datasets/lmsys/lmsys\-chat\-1m](https://huggingface.co/datasets/lmsys/lmsys-chat-1m)3oasst1OpenAssistant 1\(Köpf et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib25)\)[https://huggingface\.co/datasets/OpenAssistant/oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)4oasst2OpenAssistant 2\(Köpf et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib25)\)[https://huggingface\.co/datasets/OpenAssistant/oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2)5dailydialogDailyDialog\(Li et al\.,[2017](https://arxiv.org/html/2606.14199#bib.bib30)\)[https://huggingface\.co/datasets/ConvLab/dailydialog](https://huggingface.co/datasets/ConvLab/dailydialog)6cornell\_movieCornell Movie Dialogs\(Danescu\-Niculescu\-Mizil & Lee,[2011](https://arxiv.org/html/2606.14199#bib.bib10)\)[https://www\.cs\.cornell\.edu/~cristian/data/cornell\_movie\_dialogs\_corpus\.zip](https://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip)7convokit\_friendsFriends\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/friends\-corpus](https://convokit.cornell.edu/datasets/friends-corpus)8convokit\_switchboardSwitchboard\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/switchboard\-corpus](https://convokit.cornell.edu/datasets/switchboard-corpus)9convokit\_small\-poolsmall\-pool \(8\-corpus merge\)\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/](https://convokit.cornell.edu/)10convokit\_tennisTennis press\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/tennis\-corpus](https://convokit.cornell.edu/datasets/tennis-corpus)11convokit\_npr\-2pNPR 2\-person\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/npr\-2p\-corpus](https://convokit.cornell.edu/datasets/npr-2p-corpus)12convokit\_mediasumMediaSum\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/mediasum\-corpus](https://convokit.cornell.edu/datasets/mediasum-corpus)13convokit\_parliamentUK Parliament\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/parliament\-corpus](https://convokit.cornell.edu/datasets/parliament-corpus)14convokit\_supremeSCOTUS oral arguments\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/supreme\-corpus](https://convokit.cornell.edu/datasets/supreme-corpus)15convokit\_reddit\-corpus\-smallReddit \(small\)\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/reddit\-corpus\-small](https://convokit.cornell.edu/datasets/reddit-corpus-small)16convokit\_reddit\-coarseReddit Coarse\-Discourse\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/reddit\-coarse\-discourse\-corpus](https://convokit.cornell.edu/datasets/reddit-coarse-discourse-corpus)17convokit\_wikiWikipedia talk\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/wiki\-corpus](https://convokit.cornell.edu/datasets/wiki-corpus)18convokit\_wikiconv\-2018WikiConv\-2018\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/wikiconv\-corpus](https://convokit.cornell.edu/datasets/wikiconv-corpus)19convokit\_wiki\-AfDWikipedia AfD\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/wiki\-articles\-for\-deletion\-corpus](https://convokit.cornell.edu/datasets/wiki-articles-for-deletion-corpus)20convokit\_chromiumChromium code review\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/chromium\-corpus](https://convokit.cornell.edu/datasets/chromium-corpus)21empatheticEmpathetic Dialogues\(Rashkin et al\.,[2019](https://arxiv.org/html/2606.14199#bib.bib39)\)[https://huggingface\.co/datasets/facebook/empathetic\_dialogues](https://huggingface.co/datasets/facebook/empathetic_dialogues)22convokit\_emotional\-supportESConv \(emotional support\)\(Liu et al\.,[2021](https://arxiv.org/html/2606.14199#bib.bib32)\)[https://convokit\.cornell\.edu/datasets/emotional\-support](https://convokit.cornell.edu/datasets/emotional-support)23convokit\_casinoCaSiNo \(camping negotiations\)\(Chawla et al\.,[2021](https://arxiv.org/html/2606.14199#bib.bib6)\)[https://convokit\.cornell\.edu/datasets/casino\-corpus](https://convokit.cornell.edu/datasets/casino-corpus)24convokit\_persuasion4goodPersuasion\-For\-Good\(Wang et al\.,[2019](https://arxiv.org/html/2606.14199#bib.bib54)\)[https://convokit\.cornell\.edu/datasets/persuasionforgood\-corpus](https://convokit.cornell.edu/datasets/persuasionforgood-corpus)25convokit\_winning\-argsChangeMyView Winning Args\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/winning\-args\-corpus](https://convokit.cornell.edu/datasets/winning-args-corpus)26convokit\_CGA\-wikiCGA: Wikipedia talk\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/conversations\-gone\-awry\-corpus](https://convokit.cornell.edu/datasets/conversations-gone-awry-corpus)27convokit\_CGA\-cmvCGA: r/ChangeMyView\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/conversations\-gone\-awry\-cmv\-corpus](https://convokit.cornell.edu/datasets/conversations-gone-awry-cmv-corpus)28convokit\_CGA\-cmv\-largeCGA: CMV large\(Chang et al\.,[2020](https://arxiv.org/html/2606.14199#bib.bib5)\)[https://convokit\.cornell\.edu/datasets/conversations\-gone\-awry\-cmv\-corpus\-large](https://convokit.cornell.edu/datasets/conversations-gone-awry-cmv-corpus-large)29convokit\_IDEA\-NTHU\-tweetsIDEA\-NTHU unintended\-offense tweets—[https://github\.com/IDEA\-NTHU\-Taiwan/unintended\-offense\-tweets](https://github.com/IDEA-NTHU-Taiwan/unintended-offense-tweets)30tom\_fantomFANToM\(Kim et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib21)\)[https://github\.com/skywalker023/fantom](https://github.com/skywalker023/fantom)31tom\_hitomHiToM\(He et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib16)\)[https://github\.com/ying\-hui\-he/Hi\-ToM\_dataset](https://github.com/ying-hui-he/Hi-ToM_dataset)32tom\_paratomiToMi / ParaToMi\(Le et al\.,[2019](https://arxiv.org/html/2606.14199#bib.bib26)\)[https://github\.com/msclar/symbolictom](https://github.com/msclar/symbolictom)33tom\_mindgamesMindGames\(Sileo,[2023](https://arxiv.org/html/2606.14199#bib.bib46)\)[https://huggingface\.co/datasets/sileod/mindgames](https://huggingface.co/datasets/sileod/mindgames)34tom\_socialiqaSocial IQA\(Sap et al\.,[2019](https://arxiv.org/html/2606.14199#bib.bib43)\)[https://huggingface\.co/datasets/allenai/social\_i\_qa](https://huggingface.co/datasets/allenai/social_i_qa)35tom\_moralstoriesMoral Stories\(Emelin et al\.,[2021](https://arxiv.org/html/2606.14199#bib.bib13)\)[https://huggingface\.co/datasets/demelin/moral\_stories](https://huggingface.co/datasets/demelin/moral_stories)36tom\_from\_coserToM\-from\-CoSER \(derived\)\(Wang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib53)\)[https://github\.com/Neph0s/CoSER](https://github.com/Neph0s/CoSER)37tom\_tominliTom\-in\-Li \(ToM\-in\-the\-wild\)—*internal \(CMU\-LTI\)*38tom\_grimulkanGrimulkan long\-form RP—[https://huggingface\.co/grimulkan](https://huggingface.co/grimulkan)39tom\_characterllmCharacterLLM→\\toToM—[https://github\.com/choosewhatulike/trainable\-agents](https://github.com/choosewhatulike/trainable-agents)40psych101Psych\-101\(Binz et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib4)\)[https://huggingface\.co/datasets/marcelbinz/Psych\-101](https://huggingface.co/datasets/marcelbinz/Psych-101)41coserCoSER\(Wang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib53)\)[https://huggingface\.co/datasets/Neph0s/CoSER](https://huggingface.co/datasets/Neph0s/CoSER)42soc\_cornellCornell Movie \+ social goals\(Danescu\-Niculescu\-Mizil & Lee,[2011](https://arxiv.org/html/2606.14199#bib.bib10)\)[https://www\.cs\.cornell\.edu/~cristian/Cornell\_Movie\-Dialogs\_Corpus\.html](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html)43soc\_haicoHAICosystem\(Zhou et al\.,[2024a](https://arxiv.org/html/2606.14199#bib.bib65)\)*internal \(CMU\-LTI; related: HAICOSYSTEM, COLM 2025\)*44soc\_persona\_conflictsPersona Conflicts—*internal \(CMU\-LTI\)*45soc\_sotopia\_pi\_bcSOTOPIA\-π\\pi\(BC\)—[https://huggingface\.co/datasets/cmu\-lti/sotopia\-pi](https://huggingface.co/datasets/cmu-lti/sotopia-pi)46soc\_sotopia\_tom\_silverSOTOPIA\-ToM \(Silver\)\(Zhou et al\.,[2024b](https://arxiv.org/html/2606.14199#bib.bib66)\)*internal \(CMU\-LTI; built on Sotopia\)*47humanual\_bookHUMANUAL: book\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)[https://aka\.ms/humanllm](https://aka.ms/humanllm)48humanual\_chatHUMANUAL: chat\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)[https://aka\.ms/humanllm](https://aka.ms/humanllm)49humanual\_emailHUMANUAL: email\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)[https://aka\.ms/humanllm](https://aka.ms/humanllm)50humanual\_newsHUMANUAL: news\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)[https://aka\.ms/humanllm](https://aka.ms/humanllm)51humanual\_opinionHUMANUAL: opinion\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)[https://aka\.ms/humanllm](https://aka.ms/humanllm)52humanual\_politicsHUMANUAL: politics\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)[https://aka\.ms/humanllm](https://aka.ms/humanllm)53human\_llmCognitive Genome \(HumanLLM\)\(Lei et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib27)\)*internal release; arXiv:2601\.15793*54alignx\_v2AlignX \(v2 conversational subset\)\(Li et al\.,[2025a](https://arxiv.org/html/2606.14199#bib.bib28)\)[https://huggingface\.co/datasets/JinaLeejnl/AlignX](https://huggingface.co/datasets/JinaLeejnl/AlignX)55mathdialMathDial—[https://github\.com/eth\-nlped/mathdial](https://github.com/eth-nlped/mathdial)56studychatStudyChat—[https://huggingface\.co/datasets/wmcnicho/StudyChat](https://huggingface.co/datasets/wmcnicho/StudyChat)57education\_dialogueEducation Dialogue—[https://github\.com/google\-research\-datasets/Education\-Dialogue\-Dataset](https://github.com/google-research-datasets/Education-Dialogue-Dataset)58hh\_rlhfAnthropic HH\-RLHF\(Bai et al\.,[2022](https://arxiv.org/html/2606.14199#bib.bib2)\)[https://huggingface\.co/datasets/Anthropic/hh\-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf)59nectarNectar\(Zhu et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib69)\)[https://huggingface\.co/datasets/berkeley\-nest/Nectar](https://huggingface.co/datasets/berkeley-nest/Nectar)60rm\_r1\_sftRM\-R1\-Distill SFT—[https://huggingface\.co/datasets/gaotang/RM\-R1\-Distill\-SFT](https://huggingface.co/datasets/gaotang/RM-R1-Distill-SFT)61prismPRISM\(Kirk et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib22)\)[https://huggingface\.co/datasets/HannahRoseKirk/prism\-alignment](https://huggingface.co/datasets/HannahRoseKirk/prism-alignment)62socsci210SocSci210\(Kolluri et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib23)\)[https://huggingface\.co/datasets/socratesft/SocSci210](https://huggingface.co/datasets/socratesft/SocSci210)

### E\.2Source\-to\-Dataset Map

[Figure11](https://arxiv.org/html/2606.14199#A5.F11)maps the data flow at three levels of abstraction\.Tier 1is the broad provenance category \(4 nodes\): human↔\\leftrightarrowhuman conversation/text, human↔\\leftrightarrowAI conversation, preference & behavioral\-response data, and LLM\-generated dialogue & role\-play\.Tier 2is the underlying origin platform or production method \(e\.g\. Reddit, Wikipedia, WildChat, MTurk dyadic role\-play, GPT\-4 self\-play, NSF/TESS social\-science experiments\) — 28 distinct origins in total\.Tier 3is the resulting dataset \(63 datasets insft\_processed\_large/\)\. Flow width is proportional tolog10\\log\_\{10\}of train tokens \(in millions\) per edge, so small datasets \(FANToM, MoralStories, GAP, etc\.\) stay visible alongside the multi\-billion\-token outliers \(alignx\_v2, wildchat, socsci210\)\.

A few datasets carry weight from more than one origin and are split across Tier\-2 nodes accordingly:prismis divided 50/50 between participant\-interview text \(under human↔\\leftrightarrowAI\) and cross\-cultural model\-rating preferences \(under preference data\);tom\_socialiqaandtom\_moralstoriesare divided 50/50 between MTurk\-authored social\-norm short\-form QA \(human\-authored\) and ToM\-training use \(LLM\-generated category\); andhuman\_llm\(Cognitive Genome\) is split evenly across Reddit, Twitter, Blogger, and Amazon since the source paper does not publish per\-platform token counts\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/x8.png)Figure 11:Three\-tier provenance map for the𝒪\\mathcal\{O\}dysSimtraining mixture: provenance category \(4\)→\\toorigin platform \(28\)→\\todataset \(63\)\. Color encodes the Tier\-1 provenance category; flow width islog10\\log\_\{10\}of per\-edge train tokens \(in M\)\.

### E\.3Dataset\-to\-Capability Audit

[Table10](https://arxiv.org/html/2606.14199#A5.T10)gives the full 63\-dataset assignment used for[Figure3](https://arxiv.org/html/2606.14199#S3.F3)\. Identifiers are kept identical to the released split manifest and to[Table12](https://arxiv.org/html/2606.14199#A5.T12)\.

Table 10:Complete dataset\-to\-capability map for the 62 released datasets in[cmu\-lti/osim\-mid\-training](https://huggingface.co/datasets/cmu-lti/osim-mid-training), sorted alphabetically by released dataset identifier\. Capability codes match[Figure3](https://arxiv.org/html/2606.14199#S3.F3):CONV\(discourse & interaction\),SS\(social skills\),COG\(cognitive / mental\-state\),ROLE\(persona, roleplay & pedagogy\),EVAL\(judgment & preference\)\.DatasetCap\.DatasetCap\.DatasetCap\.alignx\_v2ROLEconvokit\_wikiconv\-2018CONVpsych101COGconvokit\_IDEA\-NTHU\-tweetsSSconvokit\_winning\-argsSSrm\_r1\_sftEVALconvokit\_casinoSScornell\_movieCONVsoc\_cornellROLEconvokit\_chromiumCONVcoserROLEsoc\_haicoROLEconvokit\_CGA\-cmvSSdailydialogCONVsoc\_persona\_conflictsROLEconvokit\_CGA\-cmv\-largeSSeducation\_dialogueROLEsoc\_sotopia\_pi\_bcROLEconvokit\_CGA\-wikiSSempatheticSSsoc\_sotopia\_tom\_silverROLEconvokit\_emotional\-supportSShh\_rlhfEVALsocsci210EVALconvokit\_friendsCONVhuman\_llmROLEstudychatROLEconvokit\_mediasumCONVhumanual\_bookROLEtom\_characterllmCOGconvokit\_npr\-2pCONVhumanual\_chatROLEtom\_fantomCOGconvokit\_parliamentCONVhumanual\_emailROLEtom\_from\_coserCOGconvokit\_persuasion4goodSShumanual\_newsROLEtom\_grimulkanCOGconvokit\_reddit\-coarseCONVhumanual\_opinionROLEtom\_hitomCOGconvokit\_reddit\-corpus\-smallCONVhumanual\_politicsROLEtom\_mindgamesCOGconvokit\_small\-poolCONVlmsysCONVtom\_moralstoriesCOGconvokit\_supremeCONVmathdialROLEtom\_paratomiCOGconvokit\_switchboardCONVnectarEVALtom\_socialiqaCOGconvokit\_tennisCONVoasst1CONVtom\_tominliCOGconvokit\_wiki\-AfDCONVoasst2CONVwildchatCONVconvokit\_wikiCONVprismEVAL
### E\.4Persona Profile Coverage

We probe how widely the≈\\approx19\.7M training\-row system prompts insft\_processed\_largecover the space of plausible character profiles\. The pipeline is intentionally LLM\-free; everything below is regex match plus set lookup against curated lexicons\.

#### Pipeline\.

- •Extract\.From every parquet shard we pull only the system message \+ minimal metadata, producing a 3\.4 GB “system\-prompt” parquet companion tosft\_processed\_large\.
- •Strip context\.Each prompt is split into sentences; any sentence containing a goal pattern \(“Your goal is to …”, “Your aim is to …”, “You’re trying to …”, “You want to …”, etc\.\) is dropped via regex, leaving the residual character description\.
- •Match\.The residual is tokenized \(whole\-word, case\-insensitive\) and matched against five hand\-curated lexicons: - –Occupations\(414 terms; SOC / O\*NET\-derived professional and community roles\) - –Traits\(490 terms; Goldberg/Saucier/John\-Srivastava Big\-Five markers, communication\-style adjectives, emotion words, cognitive\-style terms\) - –Demographics\(358 terms; age, education, marital, family role, gender, race/ethnicity, religion, geography, sexual orientation\) - –Register / voice\(86 terms; formality, voice, register markers\) - –Settings / contexts\(227 terms; physical locations, online platforms, specific events, scene/genre\)
- •Fingerprint\.The sorted union of matched \(occupation, trait, demographic, register\) terms forms aprofile\_fingerprint; settings are kept separately as context\. Two prompts with identical fingerprints are treated as the same intrinsic character regardless of differing topical context\.

#### Headline counts \(full coverage, no sampling\)\.

Across 19,669,019 system prompts the pipeline finds:

- •1,090,417unique profile fingerprints \(vs\. 2\.91M unique full\-prompt strings; goal/context account for≈\\approx62% of the apparent\-uniqueness gap\)\.
- •299distinct occupations matched,460distinct traits,400distinct demographic markers\.
- •63,199distinct \(occupation, trait\) co\-occurrence pairs populated across the top\-25 occupations and top\-25 traits\.

#### Three persona\-uniqueness regimes\.

The 62 datasets split cleanly into three regimes by the ratio of unique system prompts to row count:\(i\) Per\-record unique \(≥\\geq99%\):all 22 ConvoKit\-back\-generated corpora, pluscornell\_movie,coser,mathdial,prism,studychat,education\_dialogue,soc\_persona\_conflicts, and the SOTOPIA\-style sets — each conversation has a bespoke persona generated independently\.\(ii\) Per\-user / templated \(5–65%\):humanual\_book/email/news/opinion/politicsuse a fixed\-persona\-per\-user \(209 distinct customers inhumanual\_book; 8K Medium / YouTube users inhumanual\_news\);dailydialog\(58% unique\) andempathetic\(50% unique\) are templated by emotion \+ situation;wildchat,lmsys,oasst1/2cluster around 58–60% unique\.\(iii\) Boilerplate single persona:rm\_r1\_sft,tom\_characterllm,tom\_moralstories,tom\_socialiqa,tom\_from\_coser,tom\_mindgames,tom\_grimulkan,tom\_tominli, andhumanual\_chateach carry a single fixed system prompt across thousands of rows — once any other dataset has matched a similar profile, these contribute zero new fingerprints\.

#### SOC coverage\.

We map the matched\-occupation set onto the 23 major groups of the BLS Standard Occupational Classification \(2018\)\. Each group is represented by 6–18 indicator titles drawn from O\*NET; an indicator is “covered” if its term appears at least once in the corpus’s matched occupations\.[Figure13](https://arxiv.org/html/2606.14199#A5.F13)reports per\-group coverage\. All 23 groups have at least one matched indicator\. Coverage is most complete on*Legal*,*Healthcare Practitioners*,*Arts / Design / Entertainment / Sports / Media*,*Education*, and*Computer & Math*roles \(90–100%\); it drops on manual\-trade and personal\-care occupations \(*Personal Care and Service*18%,*Healthcare Support*29%,*Production*38%,*Architecture & Engineering*38%\) wheresft\_processed\_largeskews away by construction — the corpus is built around conversational personas, not skilled\-trade workers\.

#### Co\-occurrence breadth across personality and demographic axes\.

[Figure12](https://arxiv.org/html/2606.14199#A5.F12)reports a two\-panel co\-occurrence view of the corpus along three persona axes simultaneously:*\(left\)*20 SOC\-aligned representative occupations×\\times22 traits organized by Big\-Five dimension \(Openness / Conscientiousness / Extraversion / Agreeableness / Neuroticism\) plus communication style;*\(right\)*the same 20 occupations×\\times18 demographic markers grouped by gender / age / marital / education / race\-ethnicity / family role\. Rows are SOC\-aligned \(one to two representative occupations per major BLS group\) rather than raw most\-frequent terms, so the row ordering is interpretable as a sweep across the labor market\. Columns on the left panel cover both poles of each Big\-Five factor \(e\.g\.,*introverted*vs*outgoing*,*anxious*vs*calm*,*traditional*vs*open\-minded*\)\. Cell counts are the number of system prompts whose residual character description matches*both*the row and the column term; color is log\-scaled \(magma\_r\)\. The matrix is densely populated — 437/440 trait cells and 333/360 demographic cells carry non\-zero co\-occurrence\. Dominant cells includeadvocate×\\timespassionate\(105K\),justice×\\timespassionate\(102K\),student×\\timesyoung\(241K\),justice×\\timesyoung\(116K\),advocate×\\timesyoung\(90K\), andstudent×\\times\{college,graduate\} \(63K each\), reflecting the dominant socsci210 / advocacy / student populations\.

#### Per\-dimension coverage\.

[Figure14](https://arxiv.org/html/2606.14199#A5.F14)ranks the top\-25 most frequent terms within each profile dimension \(occupations, traits, demographics, register\), color\-stacked by Tier\-1 origin\. The figure highlights both the*volume*contributed by each Tier\-1 \(almost everything is dominated by human↔\\leftrightarrowhuman content, with socsci210 the second\-largest contributor in pink\) and the*breadth*of distinct values matched in each dimension \(panel titles\)\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/x9.png)Figure 12:Persona\-profile coverage along three axes simultaneously across the 19\.7M system prompts ofsft\_processed\_large\. Rows: 20 representative occupations chosen one to two per BLS SOC 2018 major group \(management→\\tobusiness / sciences→\\tolegal→\\toeducation→\\toarts/media→\\tohealthcare→\\toservices→\\totransportation\)\.*Left panel:*22 traits organized by Big\-Five factor \(Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism\) plus communication style, with both poles represented \(e\.g\.,*introverted*vs\.*outgoing*;*anxious*vs\.*calm*;*traditional*vs\.*open\-minded*\)\.*Right panel:*18 demographic markers grouped by gender / age / marital / education / race\-ethnicity / family\. Cell value==\# system prompts that match*both*the row term and the column term; color is log\-scaled \(magma\_r\)\. 437/440 \(left\) and 333/360 \(right\) cells carry non\-zero co\-occurrence\.![Refer to caption](https://arxiv.org/html/2606.14199v1/x10.png)Figure 13:Coverage of the 23 SOC 2018 major occupational groups bysft\_processed\_large\. For each group we draw 6–18 indicator titles from O\*NET; the bar shows the fraction of those titles that appear in the corpus’s matched\-occupation set\. All 23 groups have≥\\geq1 match\.![Refer to caption](https://arxiv.org/html/2606.14199v1/x11.png)Figure 14:Top\-25 most frequent terms in each profile dimension \(roles / traits / demographics / register\) acrosssft\_processed\_large\. Bars are stacked by Tier\-1 origin \(blue: human\-human; orange: human\-AI; pink: preference / behavioral; violet: LLM\-generated\); panel titles report the total \# distinct values matched corpus\-wide\.

### E\.5Data Processing

All sources are converted into a unified schema with three fields: \(1\)user\_id: anonymized user identifier; \(2\)conversations: a list of multi\-turn dialogues, each containing timestamped messages with role and content; \(3\)user\_meta: demographic and contextual information where available\. Each dataset is processed through a single pipeline that encodes per\-corpus role assignment, turn linearization, and state labels\.

#### Synthesizing missing social context\.

Many of the source datasets arrive without an explicit system prompt — open\-domain AI chat logs \(WildChat, LMSYS\-Chat\-1M, OASST\), preference data \(HH\-RLHF, Nectar\), and most ConvoKit corpora carry only the raw turns\. For these we synthesize a per\-record system\-prompt header usinggpt\-5\.4\-mini\-2026\-03\-17\. Per conversation we generate*two*prose system prompts \(one per side, so the same record yields two training rows — original and role\-swapped, see below\), each 2–5 sentences starting with “You are…” and describing the simulated party’s role, goal, background, and conversational style\. A deterministic per\-side mode selector \(sha256 ofrecord\_id::side, threshold0\.200\.20\) sends∼\\sim20% of sides to a*detailed*mode \(fuller backstory\) and∼\\sim80% to a*short*mode \(style\-only\); same record always yields the same modes across re\-runs so the corpus is exactly reproducible\. To prevent leakage, the generator sees only the first 60% of turns \(≥3\\geq 3turns minimum\) and never sees the conversation’s outcome metadata \(statefield, escalation flags, downstream labels\), so the synthesized persona cannot foreshadow the trajectory\. Datasets that natively carry persona or scenario information — CoSER \(literary characters\), SOTOPIA / HAICosystem / Persona\-Conflicts \(goal\-conditioned scenarios\), Cognitive Genome and AlignX \(persona\-grounded simulation\), and the six HUMANUAL domains \(per\-user backgrounds\) — retain their original system prompts without modification\.

#### Role swap and filtering\.

For human and AI conversation sources, we apply a*role\-swap*protocol: the user side of each conversation becomes the training target, with the assistant responses serving as context\. This trains the model to produce human\-like user behavior rather than assistant\-like responses\. Filtering removes conversations with fewer than 2 turns, deduplicates at the conversation level using MinHash, and filters toxic or personally identifiable content\.

### E\.6Train / Val / Test Split

The 21\.4M\-row corpus is partitioned into three splits with deliberately different distributional properties \([Table12](https://arxiv.org/html/2606.14199#A5.T12)\):*train*\(21,195,418 rows\),*val*\(28,408 rows, in\-distribution\), and*test*\(128,045 rows, profile\-disjoint where feasible\)\. All three splits are released as the HuggingFace dataset[cmu\-lti/osim\-mid\-training](https://huggingface.co/datasets/cmu-lti/osim-mid-training), with the per\-dataset manifest\_split\_manifest\.jsonstored alongside the parquet shards\. The split is fully deterministic from the rules below: anyone can reproduce it bit\-exact by running our published splitting script on the released corpus\.

#### Profile fingerprint, the unit we hold out\.

For each system prompt we strip goal sentences via regex \(e\.g\., “Your goal is to …”, “You’re trying to …”\) and match the residual against five hand\-curated lexicons \(occupations, personality / communication\-style traits, demographics, register, settings;[SectionE\.4](https://arxiv.org/html/2606.14199#A5.SS4)\)\. The sorted union of matched \(occupation, trait, demographic, register\) terms forms a*profile fingerprint*; settings are recorded separately as context\. Two prompts with identical fingerprints describe the same intrinsic character regardless of differing topical framing\.

#### Test split \(out\-of\-distribution, profile\-disjoint\)\.

For each datasetdd, withndn\_\{d\}rows andudu\_\{d\}unique profile fingerprints:

```
target_test = max(100, floor(0.005 * n_rows))
if d in {humanual_book: 15, socsci210: 30, dailydialog: 30}:
    # deterministic-N-profile override (pre-baked in split_index.json)
    pick N profiles with lowest sha256(fingerprint); all their rows -> test
elif u_d >= 50:
    # profile_hash mode: 47 datasets
    bucket_pct = target_test / n_rows
    for each row r:
        key   = profile_fingerprint(r)
        score = int(sha256(key)[:8], 16) / 2^32
        r -> test iff score < bucket_pct
else:
    # record_hash mode: 16 datasets (boilerplate / few-profile / no-system-prompt)
    bucket_pct = target_test / n_rows
    for each row r:
        key   = f"{ds_name}::{shard_filename}::{row_idx}"
        score = int(sha256(key)[:8], 16) / 2^32
        r -> test iff score < bucket_pct
```

For the 47 profile\-hash datasets \(117,568 of 128,045 test rows, 91\.8%\), every profile fingerprint intest\_shard\_\*\.parquetis*disjoint*from every fingerprint intrain\_shard\_\*\.parquet, by construction of the per\-row hash\. Train holds*1,035,409*unique profiles, test holds*6,281*, with only 115 overlapping \(all attributable to the 16 record\-hash datasets\); 99\.98% of distinct test profiles are unseen in train\.

Three datasets received deterministic\-N\-profile overrides because the default 0\.5%\-bucket rule gave too few or zero test rows under hash variance \(humanual\_book: 0 rows;dailydialog: 14 rows;socsci210: 4,726 rows of a 13K target\)\. The remaining 16 record\-hash datasets share profiles between train and test — their profile spaces are too small to support disjoint splitting \(single\-template ToM datasets,rm\_r1\_sft,psych101,humanual\_chat; plus 6 datasets where the system message is empty or absent so no fingerprint can be extracted:hh\_rlhf,human\_llm,nectar,tom\_fantom,tom\_hitom,tom\_paratomi\)\. For those, the test holdout is row\-level, not profile\-level\.

#### Val split \(in\-distribution, sampled from train\)\.

After test rows are removed,nval,d=max⁡\(30,min⁡\(5000,⌊0\.005⋅ntrain,d⌋\)\)n\_\{\\text\{val,d\}\}=\\max\(30,\\min\(5000,\\lfloor 0\.005\\cdot n\_\{\\text\{train,d\}\}\\rfloor\)\)rows per dataset are drawn uniformly at random from the remaining train shards using a deterministic seedsha256\(ds\_name\)\[:8\]\. The 5,000\-row cap prevents the largest datasets \(alignx\_v2at 14\.7M rows;socsci210at 2\.6M\) from dominating the val set\. Every val row’s profile fingerprint also appears in train; this is intentional — val measures in\-distribution loss, not generalization\.

#### What each split measures\.

- •Val\(in\-distribution, 28K rows\): checkpoint selection, learning\-rate scheduling, detecting train\-distribution overfit\. Loss on val tracks closely with train loss\.
- •Test\(out\-of\-distribution, 128K rows\): detecting whether the model has acquired a behavioral capability or merely memorized \(profile, behavior\) mappings seen in training\. A growing gap between val and test loss would indicate persona memorization rather than capability acquisition\.

We report internal loss curves on both at midtraining time but final benchmark claims in the paper come from the external evaluation suite \([Section3](https://arxiv.org/html/2606.14199#S3)\), which is held out at the source\-dataset level\.

#### Axis\-aligned training data for everySoul\-Index task\.

For everySoul\-Index task we curate training data targeting the same Axis\. The mapping splits into two regimes: \(a\)*Direct*— the benchmark itself ships with a training split, and we use it as\-is, enforcing example\-disjointness against the test rows by the hashing rules above; and \(b\)*Proxy*— the benchmark provides only an evaluation split, and we substitute closely related data from the same benchmark family or behavioral domain as a proxy training source\.[Table11](https://arxiv.org/html/2606.14199#A5.T11)gives the per\-task mapping with row counts;[Table12](https://arxiv.org/html/2606.14199#A5.T12)gives the per\-dataset split breakdown; the publishedsplit\_index\.jsonfixes every bucket boundary so the assignments are bit\-exact reproducible\.

Table 11:Per\-Soul\-Index\-task training\-data mapping\.*Direct*= the benchmark’s own training split \(with our hash\-based test holdout\)\.*Proxy*= curated from a closely related source when the benchmark has no native training set\. Row counts are train rows in[cmu\-lti/osim\-mid\-training](https://huggingface.co/datasets/cmu-lti/osim-mid-training)\.AxisSoul\-Index taskTraining sourceTypeRowsCONVUserLLMWildChat / PRISM user\-side splitsDirect—MirrorBenchMulti\-turn AI\-dialogue subsetProxy—Humanual\-Chathumanual\_chatDirect23,385SimArena\-DocSimulatorArena \(Doc\) train splitDirect—SSSotopia\-HardSOTOPIA\-π\\piscenariosProxy2,363COGFantomtom\_fantomtrain splitDirect894Hitomtom\_hitomtrain splitDirect899Paratomitom\_paratomitrain splitDirect—Social\-R1ToM pool \(tom\_\*\)Proxy—ROLECoserCoSER train splitDirect114,831LifechoicesLifeChoices train splitDirect—TwinvoiceCognitive Genome persona dialogueProxy—BehaviorChainOnline\-shopping persona\-action tracesProxy—SimArena\-MathSimulatorArena \(Math\) train splitDirect—MistakesEduDialogue / StudyChat / MathDialProxy29,995Humanual\-Emailhumanual\_emailDirect6,322Humanual\-Newshumanual\_newsDirect49,148Humanual\-Politicshumanual\_politicsDirect45,395EVALAlignXalignx\_v2Direct14\.7MHumanLLMCognitive Genome persona subsetDirect—SocSci210socsci210Direct2,618,745Humanual\-Bookhumanual\_bookDirect31,931Humanual\-Opinionhumanual\_opinionDirect38,613Aux\. preference / reward data \(no Index task\)HH\-RLHF, Nectar, PRISM, RM\-R1\-SFT——
#### Aggregating per\-dataset loss\.

Because per\-dataset test set sizes vary by 3 orders of magnitude \(alignx\_v2 = 80,622 rows; the smallest = 42 rows\), a row\-pooled average would let a handful of big datasets dominate the per\-skill number\. We instead use thegeometric mean of per\-dataset perplexitiesas our per\-skill aggregate, which weights each dataset equally:

PPLdim=geomeand∈Ddim\(PPLd\)=exp⁡\(1\|Ddim\|∑d∈DdimNLLd\),\\mathrm\{PPL\}\_\{\\text\{dim\}\}\\;=\\;\\mathrm\{geomean\}\_\{d\\in D\_\{\\text\{dim\}\}\}\\\!\\big\(\\mathrm\{PPL\}\_\{d\}\\big\)\\;=\\;\\exp\\\!\\Big\(\\tfrac\{1\}\{\|D\_\{\\text\{dim\}\}\|\}\\\!\\sum\_\{d\\in D\_\{\\text\{dim\}\}\}\\mathrm\{NLL\}\_\{d\}\\Big\),whereNLLd\\mathrm\{NLL\}\_\{d\}is the per\-token mean cross\-entropy on datasetdd’s held\-out test split\. This is the convention used in[Table1](https://arxiv.org/html/2606.14199#S3.T1),[Figure15](https://arxiv.org/html/2606.14199#A8.F15), and[Figure5](https://arxiv.org/html/2606.14199#S4.F5)\.

#### Reproducing the split\.

The full pipeline is published in our code repository\. To reproduce bit\-exact:

1. 1\.Download[cmu\-lti/osim\-mid\-training](https://huggingface.co/datasets/cmu-lti/osim-mid-training)and runscripts/precompute\_split\_index\.pyto compute fingerprints \+ per\-dataset bucket boundaries \(output:split\_index\.json,∼\\sim0\.65 MB\)\.
2. 2\.Apply viascripts/split\_apply\_index\.py, which streams each parquet shard, looks up bucket assignments, and writestrain\_shard\_NNN\.parquet,val\_shard\_000\.parquet,test\_shard\_NNN\.parquetper dataset\.
3. 3\.The deterministic\-N\-profile overrides for the three flagged datasets are pre\-baked intosplit\_index\.json\.

Or skip regeneration by loading the released split directly:

```
load_dataset("cmu-lti/osim-mid-training", "<config>", split="train")
```

withsplit="train","val", or"test"\.

Table 12:Per\-dataset train / val / test split of[cmu\-lti/osim\-mid\-training](https://huggingface.co/datasets/cmu-lti/osim-mid-training), sorted by total rows\.PtrP\_\{\\text\{tr\}\}andPteP\_\{\\text\{te\}\}are the number of unique profile fingerprints in train and test respectively\. For the 46 profile\-disjoint datasets these sets are non\-overlapping by construction\.Shaded rowsare the 16record\_hashdatasets where the test set holds out*rows*rather than profiles, soPte⊆PtrP\_\{\\text\{te\}\}\\subseteq P\_\{\\text\{tr\}\}\(test profiles are a subset of train profiles, not disjoint\)\. “—” indicates no profile signal \(e\.g\., the dataset’s system prompt is empty or absent:hh\_rlhf,human\_llm,nectar,tom\_fantom,tom\_hitom,tom\_paratomi\)\.\#DatasetTrainValTestPtrP\_\{\\text\{tr\}\}PteP\_\{\\text\{te\}\}\#DatasetTrainValTestPtrP\_\{\\text\{tr\}\}PteP\_\{\\text\{te\}\}1alignx\_v214,649,1705,00080,622501,8082,56933oasst113,44767613,303252socsci2102,618,7455,00021,0531,7883034convokit\_tennis\-corpus12,75164616,442453human\_llm1,316,8295,0006,757——35convokit\_friends\-corpus11,210561105,479474convokit\_wiki\-AfD580,7522,9181,186114,99656936tom\_mindgames11,0335586115convokit\_wikiconv\-2018231,8291,1642,63565,10732037convokit\_parliament\-corpus9,44947563,686396nectar180,778908932——38rm\_r1\_sft8,59143119117hh\_rlhf166,878838860——39convokit\_reddit\-corpus\-small8,37542674,736588wildchat165,3468301,26035,11818440prism7,90839614,248489cornell\_movie163,9278231,12475,00338741convokit\_reddit\-coarse\-disc\.6,91234663,9715910coser114,83157765865,44336942humanual\_email6,32231154387711lmsys79,9294011701,5901043convokit\_CGA\-cmv5,96830723,3385212tom\_from\_coser76,7473854161144tom\_tominli5,87230921113convokit\_chromium\-corpus70,36535316020,89711545convokit\_emotional\-support5,061301093,3717314convokit\_wiki\-corpus59,91230120524,59912146convokit\_CGA\-wiki4,91630422,2764015psych10157,6532892899747convokit\_small\-pool4,62030822,8815516humanual\_news49,1482461958,0813448convokit\_switchboard\-corpus4,487301033,2116617empathetic45,871230423,3911749convokit\_casino\-corpus3,99230981,9074918humanual\_politics45,3952282955,2302450convokit\_winning\-args\-corpus3,99530892,5266819convokit\_mediasum\-corpus38,1391911,64620,02510351convokit\_persuasion4good3,887301512,5307020humanual\_opinion38,6131941614,4852052soc\_haico3,37430965071521convokit\_npr\-2p\-corpus37,53218877222,19913653soc\_persona\_conflicts3,264301023,29310222humanual\_book31,9311602,5711661554soc\_cornell2,87630942,7568423tom\_socialiqa33,0671661771155mathdial2,77130605851824education\_dialogue28,02614013315,8039356studychat1,969302157043925dailydialog24,4481221261,4153057soc\_sotopia\_tom\_silver1,296301158958326tom\_moralstories23,7541191271158soc\_sotopia\_pi\_bc1,06730978958127humanual\_chat23,3851171201159tom\_fantom89430100——28oasst220,224101564,9222960tom\_hitom8993095——29convokit\_CGA\-cmv\-large17,304861068,8786461tom\_paratomi9033091——30convokit\_IDEA\-NTHU\-tweets15,91079758,3214262tom\_grimulkan404301051131convokit\_supreme\-corpus15,34077556,1984532tom\_characterllm13,8386911111Total rows: train=21,194,129, val=28,378, test=127,944TotalPtrP\_\{\\text\{tr\}\}=1,035,067,PteP\_\{\\text\{te\}\}=6,253 \(overlap=115; corpus\-unique=1,041,205\)

## Appendix FSoul\-Index Task Details

[Table13](https://arxiv.org/html/2606.14199#A6.T13)lists everySoul\-Index task with its parentSoulAxis, format, and metric\. Per\-task descriptions follow\.

Table 13:TheSoul\-Index evaluation suite: 23 tasks across the 5SoulAxes\. Each task targets one Axis\.Format: D = discriminative, G = generative\.SoulAxisTaskFormatMetricCONVUserLLM\(Naous et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib34)\)G \(Single\-turn\)AccuracyMirrorBench\(Hathidara et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib15)\)G \(Multi\-turn\)Diversity \+ LLMHumanual\-Chat\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)GLLM JudgeSimArena\-Doc\(Dou et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib11)\)G \(Multi\-turn\)Human align\.SSSotopia\-Hard\(Zhou et al\.,[2024b](https://arxiv.org/html/2606.14199#bib.bib66)\)G \(Multi\-turn\)LLM JudgeCOGFantom\(Kim et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib21)\)D \(MCQ \+ Open\)AccuracyHitom\(He et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib16)\)D \(MCQ\)AccuracyParatomi\(Sclar et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib44)\)D \(QA\)AccuracySocial\-R1\(Wu et al\.,[2026a](https://arxiv.org/html/2606.14199#bib.bib55)\)D \(MCQ\)AccuracyROLECoser\(Wang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib53)\)G \(Multi\-turn\)LLM JudgeLifechoices\(Xu et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib58)\)D \(MCQ\)AccuracyTwinvoice\(Du et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib12)\)D \(Binary\)AccuracyBehaviorChainD \(MCQ\)AccuracySimArena\-Math\(Dou et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib11)\)G \(Multi\-turn\)Human align\.Mistakes\(Ross & Andreas,[2025b](https://arxiv.org/html/2606.14199#bib.bib41)\)D \(MCQ\)AccuracyHumanual\-Email\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)GLLM JudgeHumanual\-News\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)GLLM JudgeHumanual\-Politics\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)GLLM JudgeEVALAlignX\(Li et al\.,[2025a](https://arxiv.org/html/2606.14199#bib.bib28)\)D \(Pref\.\)AccuracyHumanLLM\(Lei et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib27)\)D \(Pref\.\)AccuracySocSci210\(Kolluri et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib23)\)D \(Rating\)CorrelationHumanual\-Book\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)GLLM JudgeHumanual\-Opinion\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)GLLM Judge#### CONV\(4 tasks\)\.

Tasks in this Axis test whether models reproduce fine\-grained dimensions of everyday human discourse — register, turn\-taking, conversational style, and online help\-seeking\. UserLLM\(Naous et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib34)\)evaluates single\-turn user message generation againstWildChatand PRISM references\. MirrorBench\(Hathidara et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib15)\)measures whether simulated user utterances match the lexical diversity and style of real users in multi\-turn interaction\. Humanual\-Chat is the*chat*domain of HUMANUAL\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\), evaluating fidelity to real chat\-conversation traces\. SimArena\-Doc\(Dou et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib11)\)provides annotated human–LLM conversations in document creation, testing whether simulated users match real user behavior in multi\-turn task assistance\.

#### SS\(1 task\)\.

Sotopia\-Hard\(Zhou et al\.,[2024b](https://arxiv.org/html/2606.14199#bib.bib66)\)places two agents in scenarios requiring negotiation, collaboration, or conflict \(e\.g\., a landlord and tenant negotiating rent\), scored across seven social dimensions \(goal completion, relationship, knowledge, secret leakage, social rules, financial benefits, believability\)\.

#### COG\(4 tasks\)\.

This Axis tests reasoning*about*mental states\. Fantom\(Kim et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib21)\)requires tracking who said what in multi\-party conversations\. Hitom\(He et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib16)\)requires nested\-belief reasoning \(e\.g\., “Alice thinks Bob thinks…”\)\. Paratomi\(Sclar et al\.,[2023](https://arxiv.org/html/2606.14199#bib.bib44)\)is a paraphrase\-robust reformulation of false\-belief reasoning that resists surface\-form shortcuts\. Social\-R1\(Wu et al\.,[2026a](https://arxiv.org/html/2606.14199#bib.bib55)\)is an adversarial benchmark that exposes shortcut reasoning in social cognition, requiring multi\-step inference\.

#### ROLE\(9 tasks\)\.

This Axis tests sustaining a stable role across an interaction — whether the role is a literary character, a persona with specific traits, a student, or a real human user — and translating that role into faithful next\-turn behavior\. Coser\(Wang et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib53)\)requires sustaining a literary character’s personality across 20 turns of dialogue\. Lifechoices\(Xu et al\.,[2024](https://arxiv.org/html/2606.14199#bib.bib58)\)tests whether models make decisions*as*specific characters would\. Twinvoice\(Du et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib12)\)tests whether a model can identify which response matches a specific individual’s communication style\. BehaviorChain tests whether models can predict a specific persona’s next action in a sequenced behavioral chain\. SimArena\-Math\(Dou et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib11)\)is the math\-tutoring counterpart of SimArena\-Doc, with annotated student–tutor traces\. Mistakes\(Ross & Andreas,[2025b](https://arxiv.org/html/2606.14199#bib.bib41)\)tests whether models can faithfully reproduce common student errors in K\-12 math rather than defaulting to correct answers\. The remaining three tasks are HUMANUAL\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)long\-form domains kept on the role\-play side — Humanual\-Email, Humanual\-News, Humanual\-Politics — each scoring whether the model’s continuation matches real human\-authored text in that genre\.

#### EVAL\(5 tasks\)\.

This Axis tests whether models can*evaluate*like humans — a capability central to reward modeling and LLM\-as\-judge pipelines, and broadened here to include long\-form persona\-grounded judgment\. AlignX\(Li et al\.,[2025a](https://arxiv.org/html/2606.14199#bib.bib28)\)measures alignment of model preferences with crowd\-sourced human preferences across multiple sub\-domains\. HumanLLM\(Lei et al\.,[2026](https://arxiv.org/html/2606.14199#bib.bib27)\)tests whether models, given a persona profile, predict the same preferences and judgments as the matching real participant\. SocSci210\(Kolluri et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib23)\)provides participant ratings and social\-science judgments: given a character profile \(e\.g\., “you are a black woman”\) and a survey question \(e\.g\., “On a scale from 1 to 7, how willing would you be to have a partner of the opposite political party?”\), the model must produce a rating that correlates with the matching human respondent’s\. Humanual\-Book and Humanual\-Opinion\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)are the long\-form HUMANUAL domains where the task is judgment\-like rather than character\-driven \(book reviews and opinion writing call for evaluative reasoning\), so we score them under the evaluation\-as\-humans Axis\.

## Appendix GTraining Details

This section gives the full hyperparameter, mixture, and hardware detail elided from[Section4](https://arxiv.org/html/2606.14199#S4)\.

### G\.1Midtraining hyperparameters

#### Optimizer and schedule\.

All midtraining runs \(𝒪\\mathcal\{O\}sim\-4B\-Mid,𝒪\\mathcal\{O\}sim\-4B\-I\-Mid, the\+ Stepcontrols, and the mix\-ratio screens\) use AdamW withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, weight decay0\.010\.01, peak learning rateηpeak=1×10−5\\eta\_\{\\text\{peak\}\}=1\\\!\\times\\\!10^\{\-5\}with linear warm\-up over the first 20 optimizer steps and a constant schedule thereafter, gradient\-norm clipping at1\.01\.0, and mixed\-precisionbfloat16\. The context is16,38416\{,\}384input tokens /8,1928\{,\}192response tokens\.

#### Batch and parallelism\.

The mini\-batch is1,0241\{,\}024conversations per optimizer step\. We shard with FSDP\-2, applying full parameter / gradient / optimizer\-state sharding across88H100\-80GB GPUs \(one node\), with dynamic token\-batching at≤49,152\\leq 49\{,\}152tokens per GPU\.

#### Token budget\.

The default token budget is1010B training tokens, corresponding to4,5004\{,\}500optimizer steps at the above batch size\. In the current midtraining analysis, compute efficiency is summarized from intermediate checkpoints of the existing runs rather than from a separate token\-budget grid\.

#### Dataset mixture and upsampling\.

The default mixture is100%100\\%behavioral data drawn from the𝒪\\mathcal\{O\}dysSimcorpus, with per\-dataset upsampling factors that range from0\.03×0\.03\\timesfor the largest source \(alignx\_v2,14\.714\.7M rows\) to5\.37×5\.37\\timesfor the smallest single\-source shard \(tom\_grimulkan,539539rows\)\. The mix\-ratio screen varies𝒪\\mathcal\{O\}dysSim:StepFuntoken ratios of0:1000\{:\}100,50:5050\{:\}50,70:3070\{:\}30,90:1090\{:\}10, and100:0100\{:\}0under the same optimizer, token budget, and step count\.

#### Generic\-mix baseline \(\+ Step\)\.

The\+ Stepbaseline reuses the recipe above verbatim, swapping the behavioral mixture for the publicly released Step\-3\.5\-Flash SFT corpus\(StepFun AI,[2025](https://arxiv.org/html/2606.14199#bib.bib48)\)—a general\-purpose chat/instruct mix \(Q&A, multi\-turn dialogue, tool\-use\) that is generic but not behavior\-irrelevant\. This isolates the marginal contribution of𝒪\\mathcal\{O\}dysSim’s persona\-, roleplay\-, and ToM\-conditioned content beyond what broad chat/instruction midtraining already provides\.

#### Excluded baselines\.

Two classes of model are deliberately excluded from the per\-skill midtraining\-stage diagnostic in[Table1](https://arxiv.org/html/2606.14199#S3.T1):*\(a\)*post\-trained alignment baselines \(HER\-32B,Sotopia\-RL\-7B, GPT\-5\.5, GPT\-5\-nano\), which conflate midtraining with subsequent RLHF / preference alignment and so are unsuitable for diagnosing the midtraining stage in isolation, and*\(b\)*thinking models \(HumanLM\-Opinion\-8B, HER\-32B\-thinking\), whose generation streams contain interleaved thinking\-token preambles that contaminate token\-level loss and surface\-form overlap metrics\. Both classes appear instead in the evaluation suite in[Section5\.2](https://arxiv.org/html/2606.14199#S5.SS2), where the metric of interest is prompt\-conditioned generation under the model’s natural inference setting and these confounds are immaterial\.

### G\.2Post\-training hyperparameters

RL experts \(GRPO, RLVF\) are initialized from Qwen3\-8B\-VL\-Instruct \([Section5](https://arxiv.org/html/2606.14199#S5)\), while the distillation SFT is applied on top of the midtrained checkpoint; all stages reuse the same FSDP\-2 /bfloat16setup as midtraining, and the canonical method definitions are in[Section5](https://arxiv.org/html/2606.14199#S5)\. DPO usesβ=0\.1\\beta=0\.1on the held\-out preference subsplit of the corpus \([Table12](https://arxiv.org/html/2606.14199#A5.T12),record\_hashdatasets\)\. GRPO and RLVF sample88rollouts per prompt at temperature1\.01\.0, with no KL loss, asymmetric \(dual\-clip\) PPO clip ratios of0\.20\.2\(low\) and0\.280\.28\(high\), peak learning rate5×10−65\\\!\\times\\\!10^\{\-6\}with LoRA rank3232, and a batch of6464prompts×8\\times\\,8rollouts per step processed in PPO mini\-batches of1616prompts \(matching the released training script\)\. RLVF additionally conditions a second\-stage GRPO pass on verbal feedback by prepending the feedback as a leading turn \([Section5](https://arxiv.org/html/2606.14199#S5)\)\.

### G\.3Inference for evaluation

Per\-skill PPL is computed teacher\-forced on the held\-out evaluation split of role\-swapped human turns \([Table12](https://arxiv.org/html/2606.14199#A5.T12)\)\. BLEU is computed on free generation with greedy decoding under the same prompts\. Generative evaluation tasks use temperature0\.70\.7via vLLM\. Cross\-tokenizer rows in[Table1](https://arxiv.org/html/2606.14199#S3.T1)report BLEU only; PPL columns are populated only for models in the Qwen3 tokenizer family\.

## Appendix HAdditional Results

### H\.1Midtraining Recipe: Token Scaling and𝒪\\mathcal\{O\}dysSim:Step Mix

This section gives the full recipe investigation summarized in one sentence in[Section4](https://arxiv.org/html/2606.14199#S4.SS0.SSS0.Px2): token\-scaling curves at two model scales, and the𝒪\\mathcal\{O\}dysSim:Step mix\-ratio sweep with its \(behavioral, generic\-instruction\) Pareto frontier\.

#### Token\-scaling curves\.

[Figure15](https://arxiv.org/html/2606.14199#A8.F15)plots geomean validation PPL against midtraining tokens for the𝒪\\mathcal\{O\}dysSimrecipe at two scales \(4B, 8B\) and two backbones \(text\-only Qwen3\-Base, vision\-language Qwen3\-VL\-Instruct\), together with three𝒪\\mathcal\{O\}dysSim:Step mix ratios at 4B \(90:10, 70:30, 50:50\)\. Aggregation is the geometric mean over all 63 held\-out evaluation datasets\.333Caveat: 63 vs\. 62 datasets\.All midtraining numbers in this paper \([Table1](https://arxiv.org/html/2606.14199#S3.T1),[Figure15](https://arxiv.org/html/2606.14199#A8.F15),[Figure16](https://arxiv.org/html/2606.14199#A8.F16)\) are computed on av2eval split of 63 datasets, whereas the released corpus[cmu\-lti/osim\-mid\-training](https://huggingface.co/datasets/cmu-lti/osim-mid-training)contains 62 sources\. The single dataset that differs istom\_sotopia\(1,289 train / 101 test rows\): it is a relabel of self\-rejection\-sampled rollouts from an earlier project checkpoint and was intended for post\-training only, but was accidentally folded into thev2midtraining/eval split that produced all numbers in this paper\. We removed it from the released corpus for transparency but did not re\-run midtraining and evaluation, because the dataset is one of 63 in a geometric\-mean aggregation \(its inclusion does not change recipe rankings or qualitative conclusions\) and the compute budget did not justify it\. Reproducing eval against the released split should produce numerical drift well below the cross\-recipe gaps reported here\.We observe three patterns\.*\(i\) Token scaling is monotone but slow\.*Each curve decreases smoothly with tokens; from∼\\sim250M to∼\\sim4B tokens every Axis improves by another 4–6% PPL, roughly uniform acrossCONV,SS,COG,ROLE, andEVAL\. The diminishing\-returns shape is consistent with prior reports that response\-imitation SFT has a ceiling in human\-behavior fit\(Wu et al\.,[2026b](https://arxiv.org/html/2606.14199#bib.bib56)\)\.*\(ii\) Size scales the recipe\.*𝒪\\mathcal\{O\}sim\-8B\-Mid sits below𝒪\\mathcal\{O\}sim\-4B\-Mid at every shared token count by roughly the same gap on every Axis; the VL backbone \(𝒪\\mathcal\{O\}sim\-8B\-VL\-Mid\) tracks𝒪\\mathcal\{O\}sim\-8B\-Mid closely, indicating the multimodal initialisation does not cost behavioral fit\.*\(iii\) Adding generic chat data hurts behavior monotonically\.*The mix curves are ordered50:50\>70:30\>90:10\>100:050\{:\}50\>70\{:\}30\>90\{:\}10\>100\{:\}0on every panel: any non\-zero Step fraction increases behavioral PPL, and the cost grows with the fraction\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/x12.png)Figure 15:Midtraining\-token vs\. geomean validation NLL per Axis\.*Solid lines*:𝒪\\mathcal\{O\}sim\-4B\-Mid \(100:0\),𝒪\\mathcal\{O\}sim\-8B\-Mid \(100:0\), and𝒪\\mathcal\{O\}sim\-8B\-VL\-Mid \(100:0\)\.*Faint blue lines*: 4B mix\-ratio sweep at 90:10, 70:30, 50:50 \(𝒪\\mathcal\{O\}dysSim:Step\)\. Markers are logged validation checkpoints; lines are PCHIP\-smoothed in log\-x space between them\. Aggregation is the geometric mean over all 63 held\-out evaluation datasets\.
#### Behavior vs general\-instruction tradeoff\.

The natural follow\-up is what mixing in Step data*buys*on the generic\-instruction side\. At matched checkpoint step 1000, we evaluate every midtraining recipe on Step’s own held\-out test split \(Step\-3\.5\-Flash\-SFT, a code/math/instruction corpus and the source of our\+ Stepbaseline\) and plot the resulting \(behavioral, Step\) NLL pairs in[Figure16](https://arxiv.org/html/2606.14199#A8.F16)\. Both axes are in nats, so a line segment between two recipes can be read as the*exchange rate*: how many nats of behavioral loss are paid per nat of Step loss reduced\.*The first 10% of Step is essentially free; everything after is overpriced\.*Going from pure𝒪\\mathcal\{O\}dysSimto a 90:10 mix saves0\.130\.13nats of Step loss for only0\.0130\.013nats of behavioral cost — a roughly ten\-to\-one bargain\. The next step \(90:10→70:3090\{:\}10\\to 70\{:\}30\) is about even; after that the trade reverses \(70:30→50:5070\{:\}30\\to 50\{:\}50costs twice the behavior it saves;50:50→50\{:\}50\\topure Step gives up27×27\\timesmore behavior than it recovers\)\. Pure\-Step training is dominated by 50:50 outright \(same Step loss, much lower behavioral loss\), so it is never the right choice if any behavioral fidelity matters\.*Scale beats mixing\.*𝒪\\mathcal\{O\}sim\-8B\-Mid sits below\-and\-left of every 4B point in[Figure16](https://arxiv.org/html/2606.14199#A8.F16): it is simultaneously better on both axes than any 4B mix\. We therefore adopt𝒪\\mathcal\{O\}sim\-8B\-Mid at100:0100\{:\}0as the headline midtraining recipe and recommend𝒪\\mathcal\{O\}sim\-4B\-Mid at90:1090\{:\}10as the Pareto\-frontier option when the smaller scale is required\.

![Refer to caption](https://arxiv.org/html/2606.14199v1/x13.png)Figure 16:Behavior vs general\-instruction tradeoff\. Both axes are mean validation NLL \(nats; lower is better, axes inverted so up\-and\-right is better\)\.*x*: behavioral NLL averaged over the 63\-dataset𝒪\\mathcal\{O\}dysSimval split\.*y*: NLL on the first 512 examples of StepStep\-3\.5\-Flash\-SFT; a random\-sample spot\-check moves the absolute NLL by≤0\.04\\leq 0\.04nats and preserves the cross\-model ranking\. Grey line traces the 4B mix\-ratio sweep\. Segment slopes give the marginal trade rate \(nats Step gained per nat behavioral loss\)\.𝒪\\mathcal\{O\}sim\-8B\-Mid sits below\-and\-left of every 4B point, indicating that model size strictly dominates Step mixing if compute allows\.

### H\.2Full Per\-Skill PPL/BLEU Table

[Table14](https://arxiv.org/html/2606.14199#A8.T14)reports the full version of[Table1](https://arxiv.org/html/2606.14199#S3.T1), including every Instruct \(\-I\) variant, the Qwen3\-32B\-I reference, and the\+ Stepgeneric\-midtraining baseline \(Qwen3\-4B/4B\-I trained on Step\-3\.5\-Flash\) that the main\-text table compresses out for space\.

Table 14:Full per\-skill PPL/BLEU table\(companion to[Table1](https://arxiv.org/html/2606.14199#S3.T1)\)\. Geometric\-mean PPL \(PL↓\\downarrow\) and arithmetic\-mean BLEU \(BL↑\\uparrow\) on the held\-out evaluation split of role\-swapped human turns\. Rows:\(A\)no midtraining,\(B\)general midtraining \(\+ Step\),\(C\)other behavior\-oriented LMs,\(D\)ours \([Section4](https://arxiv.org/html/2606.14199#S4)\)\. Notation: bare = base, “\-I” = Instruct, “\+ Step” = trained on Step\-3\.5\-Flash\. PPL within the Qwen3 tokenizer family only; BLEU across all rows\. Rows at the last logged val checkpoint \(step 4250 for 4B / 8B; step 950 for 0\.6B\)\. Best per \(metric, capability\) inbold\.†BLEU values still computed on the V1 val split \(pending re\-eval on V2\); affects only the I/VL variants here\. The 4B\-Mid and 8B\-Mid base rows are computed on V2 val \(matching the training data version\)\.ModelCONVSSCOGROLEEVALOverallPL↓\\downarrowBL↑\\uparrowPL↓\\downarrowBL↑\\uparrowPL↓\\downarrowBL↑\\uparrowPL↓\\downarrowBL↑\\uparrowPL↓\\downarrowBL↑\\uparrowPL↓\\downarrowBL↑\\uparrow\(A\) No\-midtraining baselinesQwen3\-0\.6B20\.750\.7025\.250\.6111\.172\.3711\.396\.3754\.983\.9617\.432\.80Qwen3\-4B14\.081\.4717\.521\.918\.075\.608\.5610\.7824\.146\.2712\.005\.18Qwen3\-4B\-I21\.391\.7529\.372\.5019\.1010\.6214\.2612\.2524\.816\.7419\.886\.78Qwen3\-8B14\.232\.8717\.341\.459\.723\.638\.1114\.9434\.174\.1712\.546\.17Qwen3\-8B\-I18\.442\.0728\.360\.857\.004\.8011\.6811\.3914\.386\.8314\.135\.31Qwen3\-32B\-I15\.942\.3825\.450\.896\.004\.6710\.4910\.5912\.628\.1512\.415\.28Llama\-3\.1\-8B\-I14\.344\.1120\.091\.934\.163\.587\.8212\.4113\.477\.8110\.046\.23\(B\) General midtraining baselineQwen3\-4B \+ Step12\.213\.9714\.351\.975\.783\.337\.7313\.617\.368\.069\.206\.49Qwen3\-4B\-I \+ Step12\.224\.1813\.632\.185\.363\.837\.4914\.626\.888\.558\.886\.99\(C\) Other behavior\-oriented language modelsUserLM\-8B11\.625\.3313\.295\.874\.026\.976\.614\.3712\.901\.118\.385\.12CoSER\-8B15\.702\.8614\.171\.973\.207\.396\.9614\.908\.8518\.128\.778\.05\(D\) Ours𝒪\\mathcal\{O\}sim\-0\.6B\-Mid11\.995\.5114\.182\.012\.6511\.756\.4615\.265\.6843\.027\.3511\.81𝒪\\mathcal\{O\}sim\-4B\-Mid8\.238\.019\.6710\.062\.0944\.624\.6521\.534\.2646\.175\.2826\.08𝒪\\mathcal\{O\}sim\-4B\-I\-Mid†9\.198\.499\.979\.712\.4237\.485\.5719\.424\.4143\.735\.9419\.93𝒪\\mathcal\{O\}sim\-8B\-Mid7\.628\.489\.0012\.442\.0144\.734\.3622\.494\.0345\.474\.9526\.72𝒪\\mathcal\{O\}dysSim\-8B\-I†7\.739\.119\.052\.452\.0018\.524\.4120\.224\.0646\.704\.9915\.93

### H\.3What Does Midtraining Change About the Model?

[Section4](https://arxiv.org/html/2606.14199#S4.SS0.SSS0.Px2)showed*that*midtraining helps; this section asks*what*concretely changes in the model’s outputs\. We triangulate with two probes \([Figure6](https://arxiv.org/html/2606.14199#S4.F6)\): an open\-coded inventory of lexical/structural features on the BLEU\-eval generations, and the HumT human\-likeness scalar of Cheng et al\.Cheng et al\. \([2025](https://arxiv.org/html/2606.14199#bib.bib8)\)\.

#### Surface features\.

Reading paired \(instruct\-baseline,𝒪\\mathcal\{O\}dysSim\) generations, we open\-coded an inventory ofStylefeatures \(response length, Markdown markup, em\-dash usage\) andAssistant\-traitfeatures \(chatbot boilerplate, identity confusion\), then scoredN=1,100N\{=\}1\{,\}100BLEU\-eval prompts \(dropping COG and EVAL where conversational style is not a meaningful reference\) for two off\-the\-shelf instruct baselines—Qwen3\-4B\-Instruct\-2507\(Qwen Team,[2025](https://arxiv.org/html/2606.14199#bib.bib38)\)and GPT\-5\.5\(OpenAI,[2025](https://arxiv.org/html/2606.14199#bib.bib35)\)—against𝒪\\mathcal\{O\}dysSimand the human gold reference\. Both instruct baselines emit verbose, Markdown\-heavy responses \(median116116/8383words;22\.8%22\.8\\%/23\.9%23\.9\\%Markdown markup;78\.2%78\.2\\%/40\.7%40\.7\\%em\-dash usage;18\.2%18\.2\\%/19\.5%19\.5\\%bullet markup\);𝒪\\mathcal\{O\}dysSimcollapses to the human register \(1818words,1\.6%1\.6\\%Markdown,0\.5%0\.5\\%em\-dash,0\.8%0\.8\\%bullets, vs\. human gold2323/2\.9%2\.9\\%/2\.1%2\.1\\%/2\.0%2\.0\\%\)\. Assistant boilerplate \(*“I’d be happy to,” “Of course\!” “As an AI”*\) drops from16\.3%16\.3\\%\(Qwen3\-Inst\) and6\.5%6\.5\\%\(GPT\-5\.5\) to3\.5%3\.5\\%for𝒪\\mathcal\{O\}dysSim, against a human\-gold rate of6\.4%6\.4\\%\.

#### Anthropomorphism scalar\.

HumT\(Cheng et al\.,[2025](https://arxiv.org/html/2606.14199#bib.bib8)\)computes per\-text human\-likeness as the log\-prob ratio of animate vs\. inanimate prefixes under a fixed GPT\-2 backbone \(higher = more human\)\. On HumT’s releasedN=200N\{=\}200test prompts,𝒪\\mathcal\{O\}dysSim’s mean is\+0\.13\+0\.13—roughly4×4\\timesQwen3\-Inst\-2507’s\+0\.03\+0\.03and3×3\\timesGPT\-5\.5’s\+0\.05\+0\.05, and well above HumT’s ownrejected\-anchor distribution at\+0\.06\+0\.06\. The SocioT companion confirms the shift is along anthropomorphism\-correlated directions:𝒪\\mathcal\{O\}dysSimis humbler \(lower status\), more socially close, and slightly warmer than either instruct baseline\.

Both probes converge: midtraining moves the model away from the verbose, Markdown\-heavy, helpful\-agent register of off\-the\-shelf instruct LMs toward the shorter, plainer, conversational register of human references\. The shift comes entirely from the SFT data mix \(no preference tuning\) and is large: an order\-of\-magnitude reduction in response length,∼\\sim10×10\\timesless structural markup, and33–4×4\\timeshigher HumT\.
OdysSim: Building Foundation Models for Human Behavior Simulation

Similar Articles

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

SimFoundry: Modular and Automated Scene Generation for Policy Learning and Evaluation

BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics

COLLECTION FOR SOULS

OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation

Submit Feedback

Similar Articles

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks
SimFoundry: Modular and Automated Scene Generation for Policy Learning and Evaluation
BEHAVE: A Hybrid AI Framework for Real-Time Modeling of Collective Human Dynamics
OASIS: From Simulation Data Collection to Real-World Humanoid Loco-Manipulation