Dynamic In-Group Persona Generation for Enhancing Human-AI Rapport
Summary
This paper introduces a method for LLM-based chatbots to dynamically generate in-group personas by first identifying a user's primary concern and then creating a synthetic persona that shares that concern. A human-subject study demonstrates significant improvements in perceived rapport and user engagement compared to baseline conditions.
View Cached Full Text
Cached at: 06/18/26, 05:43 AM
# Dynamic In-Group Persona Generation for Enhancing Human–AI Rapport
Source: [https://arxiv.org/html/2606.18256](https://arxiv.org/html/2606.18256)
Yoonseok Oh1∗, Inseo Jung1∗, Jinkyu Kim1,2, Jungbeom Lee1, Minwoo Kang3†, Suhong Moon3† 1Korea University2Kakao Mobility3University of California, Berkeley bd9983@korea\.ac\.kr ∗Equal contribution\.†Co\-corresponding authors
###### Abstract
LLM\-based chatbots are increasingly applied in interpersonal domains such as counseling and peer support, where establishing human–AI rapport is crucial yet remains challenging\. In this work, we introduce a novel approach for conditioning LLMs within\-group personas, which \(i\) first identifies a user’s primary concern and brief personal context \(e\.g\., a computer science undergraduate worried about future career prospects\), and \(ii\) generates a synthetic in\-group persona that shares a similar primary concern while differing in background and narrative details, such as age or profession \(e\.g\., a junior researcher at an AI startup\)\. Furthermore, we conduct a human\-subject study to systematically evaluate the effectiveness of in\-group persona agents in enhancing human–AI rapport\. We compare our approach against two baseline conditions: a conventional agent without persona conditioning and an agent exhibiting minimal self\-disclosure \(e\.g\., “I’ve felt that too”\)\. Results from post\-task questionnaires assessing rapport and user experience indicate that the in\-group persona agent significantly improves perceived rapport and personal relevance compared to the baselines, and also yields more positive user experience—most notably higher engagement\.
\\NAT@set@cites
Dynamic In\-Group Persona Generation for Enhancing Human–AI Rapport
Yoonseok Oh1∗, Inseo Jung1∗, Jinkyu Kim1,2, Jungbeom Lee1, Minwoo Kang3†, Suhong Moon3†1Korea University2Kakao Mobility3University of California, Berkeleybd9983@korea\.ac\.kr∗Equal contribution\.†Co\-corresponding authors\.
## 1Introduction
Rapport, defined as “mutual attention, positivity, and coordination”tickle1990nature, is crucial in counseling and peer support, where users must feel comfortable disclosing personal experiencesnienhuis2018therapeutic;truong2019role\. In supportive and interpersonal human–AI interactions such as peer\-support or counseling\-like conversations where users discuss personal concerns and may benefit from empathetic exchange, rapport can play a central role\. Without a sense of relational connection, users may disengage, perceive the agent as impersonal, or hesitate to share meaningful informationliu2024chatgpt;pentina2023exploring\. In these contexts, rapport underpins not only conversational naturalness but also the effectiveness of AI systems in fostering user trust, cooperation, and sustained engagement\.
Figure 1:Dynamic Persona Generation for Rapport in Human\-AI Dialogue\.Comparison of three LLM agents responding to the same user utterance expressing life concern\.*No Persona*, the chatbot LLM without additional persona conditioning, gives neutral, generic guidance;*No Persona \+ Self\-Disclosure*adds a brief, generic self\-disclosure that is not tailored to the user’s elicited context; finally,*In\-group Persona Agent \(Ours\)*uses a concern\-aligned persona derived from pre\-chat context to provide contextualized self\-disclosure\.While large language model \(LLM\)\-based chatbots are increasingly deployed in domains such as mental health support, education, and customer service, establishingrapportwith users remains a central challenge\. Despite that we are witnessing rapid deployment of AI\-based counseling and mental health support systems, such as Woebot and Wysafitzpatrick2017delivering;inkster2018empathy, and test trials in clinical contextsheinz2025randomized;macneill2024effectiveness, chatbots often fail to demonstrate contextual understanding of social cues required to form rapport comparable to human counselorssands2021managing;chan2022challenges;klein2025effects\. As a result, LLM\-based systems still face limitations in maintaining long\-term effectiveness: sustained success appears contingent on users perceiving the agent as trustworthy, empathetic, and engagingsiddals2024happened;limpanopparat2024user;liu2025illusion\.
A key mechanism for fostering rapport isself\-disclosure, which refers to the act of sharing personal thoughts, feelings, or experiences to build trust and intimacy\. Interpersonal research shows that disclosure strengthens trust and intimacy via reciprocitycarpenter2015social\. In Human\-Computer Interaction \(HCI\), chatbots that share personal or emotional information elicit greater user self\-disclosure and satisfactionho2018psychological;lee2020hear, and emotional self\-disclosure has been shown to increase reuse intentionpark2023effect\.
Another mechanism to build rapport is to showsimilarity and in\-group preference: according to similarity–attraction theorybyrne1972attraction, people respond more positively to partners who resemble them\. This effect can be generalized to chatbots, where perceived similarity in personality or style enhances trust and affinityreeves1996media;jin2023birds\. In\-group categorization further amplifies these effects, with users treating agents as more trustworthy and cooperative when framed as group membersnass1997machines;eyssel2012social\. Yet, most chatbot studies have emphasizedshallow similarity\(e\.g\., tone or trait matching\) rather than shared lived experiences or social identity, leaving the deeper role ofcontextual alignmentunderexploredahn2021ai;alawi2023accepting\.
In this work, we propose a framework of*dynamic persona generation*that grounds LLM personas on users’ expressed concerns and contexts, forming the basis of ourIn\-group Persona Agent \(IPA\)\. This framework aims to improve relational outcomes in human–AI interaction through careful agent persona construction and concern\-grounded alignment\. Our approach operationalizes findings from peer support research in mental health highlighting how shared experiences between interlocutors provide unique benefits, hope, validation, and a sense of not being alonerepper2011review\. We model users’ expressedconcerns–issues they currently care about that capture their emotional and goal\-oriented statesrandle2017societal;ho2018psychological;cortland2017solidarity–that are in turn utilized to form user\-centered experiential representations that inform the agent’s persona design\.
With user concern\-based experiences incorporated as part of the persona, our LLM agents are shown to produce responses that are both empathetic and contextually appropriate\. To evaluate this framework, we conducted interactive sessions simulating everyday conversations where participants discussed personal concerns such as career and employment issues in a supportive but non\-professional context\. These dialogues provided a balanced setting where users sought both emotional empathy and practical perspectives, allowing us to assess how concern\-aligned personas influence relational outcomes in human\-AI interaction\.
##### Contributions\.
We present the following contributions:
- •We proposeIn\-group Persona Agent \(IPA\), a method that dynamically integrates persona generation and conversational inference through a multi\-stage prompt pipeline, constructing concern\-aligned personas from elicited user dialogue context\.
- •We complement the human\-subject evaluation with post\-hoc analyses that validate persona quality and characterize turn\-level and conversation\-level interaction patterns, thereby shedding light on the behavioral mechanisms underlying rapport gains\.
- •Controlled evaluations show that IPA improves rapport \(especially personal relevance\) over baselines, with modest UX gains most evident in engagement, highlightingin\-group alignmentas a key factor in human–AI relational quality\.
Figure 2:Overview of the proposedIn\-Group Persona Agent\.\(a\) Overall interaction flow:a pre\-chat dialogue elicitation with a sufficiency check collects the user’s concern; the user then selects one of five candidate personas; the main conversation proceeds with the selected persona \(see[Section˜3](https://arxiv.org/html/2606.18256#S3)\)\.\(b\) Persona generation process:traits are extracted from the pre\-chat dialogue and partitioned into*collected*vs\.*not collected*; using both information, the system composes a background and a narrative \(similar experience and resolution\) and combines the traits in different ways to produce five distinct personas\.
## 2Related Work
### 2\.1LLM Persona Conditioning
Large Language Models \(LLMs\) can be steered via prompt\-only conditioning to display persona\-like linguistic patterns without parameter updates\. Evidence to date is strongest within\-session and is task\-dependent, indicating reliable short\-horizon style control rather than persistent identitymoon2024virtual;kang2025deep;jiang2023personallm;tseng2024two\. Backstory\-style prompts improve response consistency and distributional alignment in large\-scale simulationsmoon2024virtual;kang2025deep, while personality prompts \(e\.g\., Big Five\) induce trait\-congruent tendencies across tasks under matched instructionsjiang2023personallm\. Recent evaluation work further studies LLM agents conditioned on assigned personas, assessing whether their actions, linguistic habits, and responses remain consistent with the given persona across diverse environmentsSamuelet al\.\([2024](https://arxiv.org/html/2606.18256#bib.bib1)\)\.
Unlike previous studies that relied on static backstories or trait\-based prompts to elicit transient stylistic alignment, our In\-group Persona Agent dynamically generates concern\-aligned personas grounded in users’ expressed contexts, enabling sustained relational alignment and measurable improvements in rapport and engagement\.
### 2\.2Self\-Disclosure by AI Chatbots
Research on interpersonal dynamics and the Computers Are Social Actors \(CASA\) framework indicate that disclosure\-based reciprocity can arise even when people interact with artificial agents\(collins1994self;reeves1996media\)\. Experimental work also shows reciprocal responding toward computer\-based agents, specifically increased user self\-disclosure following an agent’s disclosure\(moon2000intimate\)\. In studies on conversational AI systems, agent self\-disclosure increases users’ self\-disclosure\(lee2020hear\)and improves satisfaction \(and reuse intentions\) relative to non\-disclosing baselines\(park2023effect\)\. However, reported benefits are conditional on the contextual relevance and perceived appropriateness of the disclosure\(lee2020hear;park2023effect;tsumura2023influence\)\. Notably, prior manipulations of disclosure are typically generic \(e\.g\., emotional vs\. neutral/factual statements\) rather than tailored to the user’s present concern\(lee2017enhancing;ho2018psychological;park2023effect\)\.
We address this conditionality by operationalizing relevance via concern\-aligned persona cues and by comparing no\-persona and generic self\-disclosure baselines to isolate the added value of contextual alignment\.
### 2\.3Similarity & In\-Group Cues
Similarity manipulations affect user judgments and intentions and vary across application domains\. For example, in voice interfaces, matching voice personality increases perceived social presence\(lee2003designing\)\. In recommender chatbots, personality alignment improves evaluation and intention measures\(jin2023birds\)\. In social robotics, in\-group framing yields more favorable judgments \(e\.g\., liking, anthropomorphism\) and approach intentions\(nass1997machines;eyssel2012social\)\. These operationalizations predominantly rely on static trait/style matching or categorical group labels rather than situational alignment to a user’s ongoing concern\.
Taken together, these observations motivate a focus on*concern\-aligned*relational cues rather than reliance on generic disclosure or static trait/style matching\.
## 3Method
### 3\.1Overview
ThePersona Agentanalyzesdialogue preceding the main conversationgathered to infer the user’s concerns and related information\. Based on the inferred information, it generates a persona that belongs to the user’s in\-group and shares the same concerns as the user, embedding this fictional persona–comprising a character profile and curated past experiences–before the dialogue begins\. This embedded persona enables the agent to deliver empathetic utterances and self\-disclosing narratives, thereby achieving a higher level of rapport compared to chatbots without an explicit, tailored persona\. For instance, when responding to a user concerned about work\-related stress, a generic chatbot might respond, “I’m sorry you’re feeling stressed; consider taking breaks during work\.” In contrast, a persona\-embedded agent might share, “I remember feeling overwhelmed at my own job a few years ago\. I found that taking short outdoor walks during my breaks helped me manage my stress better\.” The latter response leverages persona\-driven self\-disclosure, which fosters stronger rapport\. Such an approach is particularly valuable in domains where user experience is paramount, including psychological support, education, and personalized services\.
### 3\.2Persona Definition
Apersonain this study consists of a syntheticagent backgroundandpast narratives\. By injecting this information into the prompt, we create apersona agent, a virtual persona capable of interacting with the user\.
##### Background\.
Background provides a concise overview of who the persona is, including aspects such as personal history, education, or interests\. While it does not directly resolve the user’s concern, it frames the agent as a coherent social actor with recognizable traits and contextual grounding\.
##### Narrative\.
The narrative recounts how the synthetic persona previously faced a challenge similar to the user’s concern and describes the steps they took to overcome it\. Based on this synthetic narrative, models engage in strategic self\-disclosure that conveys empathy and shared experience\.
### 3\.3Pre\-chat Dialogue Elicitation
During pre\-chat dialogue, a concern\-elicitation agent \(collector\) engages the user to elicit the primary concern and brief context\. After every two turns, the system checks whether the information is sufficient to proceed; if not, the agent continues with concise clarification questions\. Once sufficiency is met, the pipeline advances to persona generation\. The fixed sufficiency rule and prompt templates are provided in[Appendix˜F](https://arxiv.org/html/2606.18256#A6)\.
### 3\.4Persona Generation
Based on information collected during the pre\-chat dialogue phase, the system generatesfive in\-group personasthat reflect the user’s context and characteristics, so that all generated personas would feel relevant and relatable to the user\.
To achieve this, the system categorizes user traits into collected information \(e\.g\., concerns, explicitly stated preferences, or demographic indicators\) and information unavailable from the pre\-chat dialogue \(i\.e\., traits that remained unspecified\)\. Based on this categorization, our workflow combines available information traits in different ways to generate five distinct personas\. Each synthetic persona shares the same primary concern as the user but differs in details of the assigned background and narrative, such as age or profession, to provide diverse perspectives on the shared concern\. Traits classified as not collected information are left unspecified, so that the resulting profiles only reflect information the user had actually disclosed\.
Example of Generated Persona \(Abridged\)Background:I’m a 26\-year\-old computer science graduate who developed a fascination with AI research during my undergraduate studies, particularly after taking courses in machine learning and neural networks…Narrative:After graduating, I was eager to dive into AI research but quickly discovered that breaking into the field was more challenging than I anticipated\. Most research positions required either graduate degrees or significant practical experience that I lacked…
Details of this example can be found in[Appendix˜F](https://arxiv.org/html/2606.18256#A6)\(LABEL:lst:ex\_prechatandLABEL:lst:ex\_persona\)\.
### 3\.5Persona Quality Validation
We evaluate the validity of thedynamic persona generationframework through two complementary methods\.
#### 3\.5\.1Persona Evaluation Rubric
Persona quality is assessed using LLM\-based evaluation on two dimensions:In\-group FitnessandConcern Resolution Quality\.In\-group Fitnesscomprises two sub\-dimensions,Shared Background / Identity\(IF1\)andShared Skills / Interests\(IF2\)\.Concern Resolution QualitycomprisesConcern Match\(CR1\)andNarrative Authenticity\(CR2\)\. Each sub\-dimension is scored on a 0–4 scale, yielding up to 8 points per dimension and a total of 16 points\.
#### 3\.5\.2Pre\-chat Dialogue Information Sufficiency Check
To examine whether the pre\-chat dialogue provides adequate information for persona generation, we perform a sufficiency check\. For a dialogue consisting ofnnturns, we segment into cumulative intervals of2,4,…,n2,4,\\ldots,nturns\. Personas are generated from each interval, and their quality scores are compared\. For example, ifn=6n=6, personas are generated from turns \(1–2\), \(1–4\), and \(1–6\)\.
## 4Experiments
### 4\.1Topic of Conversation
We set the topic of conversation between the user and the agent to focus oncareer and employment\. This domain was chosen because it naturally contains a balanced mix of concerns that require emotional support \(e\.g\., coping with workplace stress\) and concerns that require problem\-solving or informational guidance \(e\.g\., job search strategies, career planning\)\. This balance makes it a suitable testbed for examining how users interact with a persona\-based conversational agent across both affective and cognitive dimensions\.
### 4\.2Procedure
Each session consisted of three stages conducted sequentially\.\(cf\. overall architecture in[Figure˜2](https://arxiv.org/html/2606.18256#S1.F2)\) All processes, including inference, persona generation, and conversation with the user, were implemented usingGPT\-4ohurst2024gpt\.
1. 1\.Pre\-chat\.A concern\-elicitation agent \(collector\) first engaged the user in a brief conversation to gather the user’s primary concern or topic of discussion\. After every two turns of conversation, the system evaluated whether the user’s concern and related contextual information were sufficiently clear\. If the information was insufficient, the agent continued to ask concise follow\-up questions until enough details were obtained\. Once the concern and its related contextual information were sufficiently collected, the session proceeded to theProfile Selectionstage\.
2. 2\.Profile Selection\.The system generated five persona profiles aligned with the user’s stated concern\. Users selected one profile after reviewing textual background information\. \(skipped ifNo Persona\)
3. 3\.Post\-chat\.The user then engaged in an open\-ended chat with the chatbot endowed with the selected persona\. After the dialogue, the user completed an online questionnaire\.
Throughout the interaction, system and user messages were logged for analysis\. The conversation proceeded in a round\-robin, turn\-based manner, where each turn consisted of a single utterance from either the user or the agent\.
Figure 3:Group comparison of Rapport and UX items\. Bars display mean±\\pm95% CI for visualization, although all statistical tests were performed on medians due to non\-normal distributions \(Shapiro–Wilk; see Appendix[B](https://arxiv.org/html/2606.18256#A2)\)\. Rank\-based one\-sided Mann–Whitney contrasts tested the pair\-wise ordered hypothesis NoP < NoPs < IPA\. Bar colors denote the three conditions: NoP, NoPs, and IPA\.
### 4\.3Participants and Conditions
We perform a randomized control trial \(RCT\) to evaluate interactions with assigned persona\-conditioned LLM agents, having grantedInstitutional Review Board\(IRB\) approval\. All participants provided informed consent prior to participation, and their privacy and anonymity were ensured throughout the study\. We recruited 210 participants via CloudResearch’s Connect platformhartman2023introducingand randomly assigned each to one of three experimental conditions \(n=70n=70per condition\):
*Ours:*In\-group Persona Agent \(IPA\)Participants conversed with an agent equipped with a user\-tailored persona\. The system first generated five candidate personas from pre\-chat information, and participants selected one before the main dialogue\.
Baseline: No Persona Agent \(NoP\)Participants interacted with an agent that retained dialogue history but had no defined persona\. This served as a baseline condition without persona\-based framing or self\-disclosure\.
Baseline: NoP with Self\-Disclosure \(NoPs\)Participants conversed with an agent that lacked a persona but could use brief, generic self\-disclosing statements \(e\.g\., preferences or feelings\) not tailored to the participant’s situation\.
Immediately following the interaction, participants completed a post\-task questionnaire\. One attention\-check item instructed participants to choose a specific response\. Of the 210 recruited participants, we excluded those who failed the attention check and sessions whose dialogues were off\-topic relative to the elicited concern\. The final analysis sample comprised 170 participants \(NoP:n=54n\{=\}54,NoPs:n=59n\{=\}59,IPA:n=57n\{=\}57\)\.
### 4\.4Measures
Immediately following the interaction, participants completed a post\-task questionnaire assessing their perceptions of the agent and the overall dialogue experience\. We adopted items frombaihaqi2024rapport, combining measures ofrapport\(six items, R1–R7 except R6111R6 \(“I really care about this virtual agent”\) was excluded from the rapport composite due to lexical ambiguity and lack of explicit interaction\-referential wording, which reduces sensitivity to single\-session effects\.\) anduser experience\(seven items\)\. All items were rated on a 7\-point Likert scale, with higher scores indicating greater rapport or more positive user experience\. The items captured key aspects such as perceived emotional connection, enjoyment, comfort, personal relevance, mutual interest \(e\.g\., “I feel a bond between this virtual agent and myself”, R5\), conversation naturalness and flow, satisfaction, interest, engagement, comprehensibility, and willingness to use the agent again\. Details on the questionnaires for rapport and UX are provided in[Appendix˜G](https://arxiv.org/html/2606.18256#A7)\.
### 4\.5Statistical Analysis
We analyzed outcomes using rank\-based nonparametric methods, given the ordinal response scales and distributional diagnostics\. Our primary test evaluated the pre\-registered ordered alternativeNoP≤\\leqNoPs≤\\leqIPAvia a one\-sided Jonckheere–Terpstra trend test\(terpstra1952asymptotic;jonckheere1954distribution\)\. To characterize condition differences within each construct family \(Rapport or UX\), we conducted planned one\-sided Mann–WhitneyUUcontrasts for all condition pairs in the hypothesized direction\(mann1947test\), controlling family\-wise error within each family using Holm’s procedure\(holm1979simple\)\. We report effect sizes and descriptive differences alongside adjustedpp\-values; analyses used the preprocessed sample \(NoP,n=54n\{=\}54;NoPs,n=59n\{=\}59;IPA,n=57n\{=\}57\)\. Further implementation details, assumption checks, and full statistical outputs are provided in[Appendix˜B](https://arxiv.org/html/2606.18256#A2)\.
## 5Results
### 5\.1Main Result
##### Overview\.
Group\-level comparisons are visualized in[Figure˜3](https://arxiv.org/html/2606.18256#S4.F3)and summarized in[Table˜3](https://arxiv.org/html/2606.18256#A1.T3)\. For visualization, data are shown as mean±\\pm95% CI to convey central tendency and variability, but all inferential tests were performed on median ranks using nonparametric procedures due to significant deviations from normality \(see[Appendix˜B](https://arxiv.org/html/2606.18256#A2)\)\.
##### Rapport\.
Across all rapport items \(R1\-R5, R7\), scores increased monotonically fromNoPtoNoPstoIPA, with the largest and most consistent gains forIPAoverNoP\. These improvements were most pronounced on items indexing perceived relatedness and mutual engagement \(especiallyR3“relevant to me,”R5“bond with the agent,” andR7“personal interest,” as well asR1“think about relationship”\)\. In contrast,NoPsprovided only modest benefits overNoP, suggesting that generic self\-disclosure alone yields limited relational improvement\. Finally,IPAalso exceededNoPsacross all rapport items, indicating that persona alignment adds relational depth beyond disclosure alone\.
##### User Experience\.
Across user\-experience items,IPAreceived the highest scores overall, whereasNoPandNoPswere generally close and did not follow a consistent ordering\. Relative toNoP, the clearest improvements forIPAcentered onE4\(engagement\) andI3\(perceived relevance\), alongside more favorable satisfaction and willingness to continue\. TheNoPs–NoPcomparison was mixed, with small gains on some items but slight declines on others, indicating that generic self\-disclosure did not reliably improve overall experience\. Finally,IPAexceededNoPsas well, with the most salient gains again tied to relevance, engagement, and continued\-use intent, confirming additional benefits from persona alignment beyond disclosure alone\.
##### Summary\.
Taken together, the analyses reveal a graded pattern of improvement: \(1\)NoP→\\rightarrowNoPs: generic self\-disclosure yields limited and mixed benefits; \(2\)NoPs→\\rightarrowIPA: persona alignment adds clear gains, especially on rapport dimensions tied to perceived relevance and personal connection; and \(3\)NoP→\\rightarrowIPA: the cumulative difference is most pronounced on key rapport and relevance\-related outcomes\. These results show that incorporating concern\-aligned in\-group persona cues consistently strengthens rapport relative to the baselines, with more modest user\-experience gains that are most evident in engagement\.
### 5\.2Validation Results
Following the validation procedures described in[Section˜3\.5](https://arxiv.org/html/2606.18256#S3.SS5), we conducted two sets of experiments: \(1\) whether the proposed rubric effectively captures in\-group alignment and concern reflection in generated personas; and \(2\) whether the amount of information collected in the pre\-chat dialogue is sufficient for high\-quality persona generation\.
#### 5\.2\.1Validation of Persona Quality Rubric
To examine whether the rubric for persona quality evaluation appropriately captures that a persona generated from the pre\-chat dialogue belongs to the user’s in\-group and reflects similar concerns, we compared the scores between the group where the persona was evaluated with the pre\-chat dia\- logue used for its generation \(Matched\) and the group where the persona was evaluated with a different pre\-chat dialogue \(Not Matched\)\.[Figure˜4](https://arxiv.org/html/2606.18256#A3.F4)\([Appendix˜C](https://arxiv.org/html/2606.18256#A3)\) shows that Matched consistently received higher scores across all rubric items, while Not Matched received lower scores\. This indicates that evaluation with the rubric can effectively verify whether a persona is appropriate for the user\.
#### 5\.2\.2Validation of Pre\-chat Dialogue Sufficiency Check
In the pre\-chat dialogue stage, we verify whether the information collected every two turns is sufficient for persona generation\. To examine this, a subset of 70 pre\-chat dialogues from the full set collected in the experiment was segmented into cumulative intervals of two turns, and a persona was generated at each interval\. In total, 480 personas were generated and subsequently evaluated for quality\. Overall, the results demonstrate that the rubric\-based evaluation is sensitive to information sufficiency, assigning lower scores to personas derived from incomplete contexts and higher scores to those generated from sufficient dialogue history\. Detailed results are provided in[Appendix˜D](https://arxiv.org/html/2606.18256#A4)\.
## 6Analysis
### 6\.1Correlation Between Persona Quality and Rapport Score
Following the validation of the persona quality rubric in the previous analysis, we conducted correlation analyses to examine whether rubric scores were associated with self\-reported outcomes \(rapport and UX; see[Tables˜6](https://arxiv.org/html/2606.18256#A2.T6)and[7](https://arxiv.org/html/2606.18256#A2.T7)for detailed results\)\. The selected personas generally demonstrated high rubric scores, with Narrative Authenticity showing zero variance and thus being excluded from correlation interpretation\.
Pearson correlation analysis revealed that overall persona quality \(Total\) exhibited significant positive correlations with R5 \(sense of connection,r=0\.36,p<\.05r=0\.36,p<\.05\) among rapport items and notably with E5 \(intention to continue conversation in the future,r=0\.37,p<\.05r=0\.37,p<\.05\) among UX items\. At the dimensional level, Shared Skills showed the strongest correlation with E5 \(r=0\.48,p<\.05r=0\.48,p<\.05\), linking shared interests to continued use\. In contrast, no consistent significant correlations emerged across other rapport items, indicating that persona quality is selectively associated with specific outcomes such as “sense of connection” and “continuation intention” rather than comprehensively explaining rapport as a whole\.
### 6\.2Turn\-level Behavioral Complement\.
While self\-reported rapport captures the overall experience, it fails to reveal*specific*behavioral nuances across conditions\. To address this, we conducted a turn\-level analysis focusing on*immediate partner responses*within the dialogue logs\. Specifically, we scored each utterance using an LLM rubric \(0–3\) for self\-disclosure \(sdtsd\_\{t\}\) and empathy \(emptemp\_\{t\}\), and subsequently classified them as “high” or “low” based on fixed thresholds \(details in[Sections˜E\.1](https://arxiv.org/html/2606.18256#A5.SS1)and[E\.2](https://arxiv.org/html/2606.18256#A5.SS2)\)\. We then calculated the conditional probability of a response in speaker\-change adjacent pairs, conditioned on the partner’s preceding behavior\. We report these probabilities along with their difference, defined as the reciprocity indexRR\([Table˜1](https://arxiv.org/html/2606.18256#S6.T1)\)\.
Conditionpsd\|highA→Up^\{A\\to U\}\_\{sd\|high\}psd\|lowA→Up^\{A\\to U\}\_\{sd\|low\}RSDA→UR^\{A\\to U\}\_\{SD\}pemp\|highU→Ap^\{U\\to A\}\_\{emp\|high\}pemp\|lowU→Ap^\{U\\to A\}\_\{emp\|low\}REMPU→AR^\{U\\to A\}\_\{EMP\}NoPNaN0\.377NaN0\.9080\.631\+0\.277NoPs0\.4250\.454\-0\.0290\.9930\.936\+0\.057IPA0\.6200\.249\+0\.3710\.9030\.539\+0\.364Table 1:Directional reciprocity in thepostsegment\.Eachppdenotes a conditional probability of a next\-turn response event \(User deep self\-disclosure or Agent empathy\) given whether the immediately preceding turn contains deep self\-disclosure \(event present vs\. absent; “high/low”\)\.RRis the difference of the two conditional probabilities \(e\.g\.,RSDA→U=psd∣highA→U−psd∣lowA→UR^\{A\\to U\}\_\{SD\}=p^\{A\\to U\}\_\{sd\\mid high\}\-p^\{A\\to U\}\_\{sd\\mid low\}\)\.NaNindicates non\-estimability due to event sparsity\.We analyze thepostsegment to capture interactions after persona manipulation was active \(see[Section˜E\.3](https://arxiv.org/html/2606.18256#A5.SS3.SSS0.Px2)\)\. IPA shows the clearest User self\-disclosure reciprocity in response to the agent \(A→UA\\to U\): deep User self\-disclosure was more likely after deep agent self\-disclosure than after non\-deep agent turns \(0\.6200\.620vs\.0\.2490\.249;RSDA→U=\+0\.371R^\{A\\to U\}\_\{SD\}=\+0\.371\)\. Agent empathy following User self\-disclosure \(U→AU\\to A\) was positive across conditions, with the largest increase in IPA \(REMPU→A=\+0\.364R^\{U\\to A\}\_\{EMP\}=\+0\.364\), while NoPs showed a smaller increase partly due to its near\-ceiling baseline empathy\.
Overall, these results suggest that IPA uniquely strengthens self\-disclosure reciprocity, while the agent’s empathic responsiveness to User self\-disclosure is broadly present across conditions\.
## 7Conclusion
This work demonstrates that embedding in\-group personas into conversational AI can meaningfully enhance rapport and user experience\. By combining pre\-chat dialogue elicitation with structured persona framing and self\-disclosure, our approach enables agents to appear more relatable, empathetic, and engaging\. After pre\-processing, we analyzed data from 170 participants and found that persona\-embedded agents consistently outperformed both non\-persona and self\-disclosing baselines on rapport measures, including perceived personal relevance\. Overall, lightweight, prompt\-based persona design offers a practical means to strengthen rapport and engagement in supportive interpersonal settings–supported by analyses linking persona quality to users’ sense of connection and continuation intention, and turn\-level evidence of increased self\-disclosure reciprocity–and extending this framework to more sensitive domains remains an important direction for future work\.
## 8Limitations
Although our study offers insights into the role of in\-group personas in conversational AI, it also carries several limitations that should be acknowledged\.
Anthropomorphism:While persona\-driven design can enhance users’ sense of connection, excessive anthropomorphic cues may also introduce unintended side effects\. Overly human\-like behaviors, such as emotional overexpression or self\-referential remarks, can create unrealistic expectations about the agent’s capabilities or authenticity\. This mismatch between user perception and system intent may lead to confusion or reduced trust, suggesting that careful calibration of anthropomorphic elements is essential for maintaining credible and comfortable human–AI interaction\.
Pre\-chat Sufficiency:Our analyses focus on sessions that met a minimum level of pre\-chat sufficiency, enabling personas to be grounded in user\-provided cues\. Because we did not systematically collect a comparable set of “insufficient” pre\-chat cases, we cannot conduct a reliable ablation on how pre\-chat quantity affects persona quality and downstream outcomes\. This limits our understanding of robustness under sparse inputs, suggesting that future work should evaluate broader ranges of pre\-chat completeness\.
Disclosure\-Averse Users:Our framework assumes that users are comfortable articulating their primary concern during the pre\-chat stage, and our experiments were designed around participants for whom such disclosure is a relatively low burden\. In practice, some users may be hesitant to share personal details upfront, potentially reducing usability and limiting persona quality\. While the pre\-chat agent is intended to ease disclosure through guided prompting, disclosure\-averse populations remain out of scope for the current study, motivating future work in this direction\.
Persona Selection Effect:In the IPA condition, participants selected one persona from five generated candidates\. We adopted this design because the brief pre\-chat phase may not provide enough information to reliably identify a single best\-matching persona automatically, and because allowing users to review and choose a disclosed persona can help them select the perspective they personally need in a peer\-support\-like setting\. However, this choice may also have increased perceived relevance, rapport, or satisfaction, potentially overstating the effect of the in\-group persona itself\.
Future work should address these limitations by expanding the scope of participants and analyses in order to provide more generalizable insights\.
## 9Ethical Considerations
##### Potential Harms of Synthetic In\-group Personas\.
We acknowledge that in sensitive domains such as mental health, trauma recovery, or medical advice, a synthetic persona that feigns shared human experience can lead to severe consequences\. In such contexts, “synthetic empathy” may induce undue trust, discourage seeking professional help, or otherwise exacerbate vulnerability\. Our study, however, was conducted in the comparatively lower\-risk context of career and employment counseling, rather than in high\-stakes settings such as suicide prevention or grief counseling\. The persona’s “experience” was limited to professional challenges and workload management, and the interactions were framed as instrumental support rather than therapeutic intervention\. Accordingly, we expect the immediate risk of harm to be substantially lower than in clinical settings, while noting that similar risks could arise if such personas were deployed in more sensitive contexts\.
##### Participant Disclosure and Right to Withdraw\.
Prior to participation, we informed participants in the study introduction about potential risks and disadvantages, including the possibility of discomfort during experimental procedures\. Participants were explicitly informed that they could withdraw at any time without penalty\. We also indicated that appropriate support would be available if any discomfort persisted\. We provide the disclosure text in the appendix, with selected portions redacted to preserve the authors’ anonymity\.
##### Scope and Limitations\.
These ethical considerations apply to the specific experimental scope described above\. We caution against generalizing the safety of synthetic in\-group personas to high\-stakes or clinical domains without domain\-specific safeguards, oversight, and evaluation\.
## References
- V\. Samuel, H\. P\. Zou, Y\. Zhou, S\. Chaudhari, A\. Kalyan, T\. Rajpurohit, A\. Deshpande, K\. Narasimhan, and V\. Murahari \(2024\)Personagym: evaluating persona agents and llms\.arXiv preprint arXiv:2407\.184168\(9\)\.Cited by:[§2\.1](https://arxiv.org/html/2606.18256#S2.SS1.p1.1)\.
## Appendix
## Appendix AExperiment Detail
This section includes details of our experiment\.[Table˜3](https://arxiv.org/html/2606.18256#A1.T3)shows overall result of human evaluation\.
### A\.1Demographic Distributions of Participants
Categoryn \(%\)GenderMan86 \(50\.6%\)Woman78 \(45\.9%\)Missing3 \(1\.8%\)Prefer not to say2 \(1\.2%\)Agender1 \(0\.6%\)RaceWhite127 \(74\.7%\)Black or African American14 \(8\.2%\)An ethnicity not listed here7 \(4\.1%\)Chinese6 \(3\.5%\)Vietnamese4 \(2\.4%\)Missing3 \(1\.8%\)Filipino3 \(1\.8%\)Asian Indian2 \(1\.2%\)Japanese1 \(0\.6%\)Other1 \(0\.6%\)Korean1 \(0\.6%\)American Indian or Alaska Native1 \(0\.6%\)Employment StatusFull\-time80 \(47\.1%\)Part\-time27 \(15\.9%\)Unemployed18 \(10\.6%\)Not in paid work14 \(8\.2%\)Business Owner10 \(5\.9%\)Student9 \(5\.3%\)Retired6 \(3\.5%\)Prefer not to say3 \(1\.8%\)Missing3 \(1\.8%\)Table 2:Demographic distributions \(n, %\)\.Total 170 participants, missing 3 informations\.[Table˜2](https://arxiv.org/html/2606.18256#A1.T2)shows demographic distributions of the participants\. Overall, the sample exhibits reasonable variation in background \(e\.g\., gender and employment situations\), and participants engaged in conversations about their own real career and employment concerns rather than role\-played or hypothetical scenarios\. At the same time, the pool is geographically and linguistically constrained \(U\.S\.\-based, English\-speaking\) and was recruited from a single crowdsourcing platform, which may limit broader generalizability\.
### A\.2Participant Disclosure: Risks, Withdrawal, and Support
Participants were informed of this procedure in advance as part of the study instructions and consent process\. They were also informed of potential risks or discomforts, data handling and privacy protections, and that participation was voluntary and could be withdrawn at any time without penalty\.
Disadvantages, Risks of Participation, and Compensation for HarmSince the study involves discussing personal concerns with a chatbot, psychological discomfort or risks may arise\. You are free to stop the experiment at any time without notifying the researcher\. If you experience psychological discomfort after completion or withdrawal, you may contact the principal investigator listed below, who will consult with the university\(Korea University\) to arrange appropriate support\.
### A\.3Differences Between IPA and Baselines
All three conditions used the same user messages as conversational context, ensuring that the baseline personalization naturally provided by LLMs was held constant across conditions\. The only difference was the type of additional personalization provided by the agent:NoPincluded no added personalization cues,NoPsincluded a brief generic self\-disclosure, andIPAincorporated a concern\-aligned in\-group persona derived from pre\-chat information\. Therefore, the observed advantages ofIPAare attributable to the added in\-group persona cues rather than generic personalization\. Moreover,IPAproduced the strongest gains specifically in relevance and perceived personal connection, which are theoretically linked to perceived in\-group similarity\. This design thus isolates the contribution of in\-group persona alignment\.
##### Example of Actual Agent Responses\.
*Ours:*In\-group Persona Agent \(IPA\)\(generic personalization \+ self\-disclosure with persona information\) Agent: Yes, after realizing my efforts weren’t fully visible, I increased my visibility with upper management and successfully secured a promotion along with a revised compensation package\.
Baseline: No Persona Agent \(NoPs\)\(generic personalization \+ self\-disclosure without persona information\) Agent: That’s perfectly fine\! Medical school will expose you to a variety of specialties, which might help clarify your interests\. I admire cardiology because of the heart’s intricate systems, but every specialty has its unique appeal\. What aspects of medicine spark your curiosity?
Baseline: NoP \(NoP\)\(generic personalization only\) Agent: That’s a great goal\! To transition, build your technical skills, particularly in areas like network architecture, encryption, and coding\. Seek out projects or training that give you engineering experience, and consider networking with professionals in the field for mentorship and guidance\.
### A\.4Participant Openness and Ecological Validity of a Paid Sample
Our framework targets users who intentionally seek concern\-oriented, personal conversations; thus, even though participants were compensated, they voluntarily engaged with a framework that clearly required sharing personal context, making the sample reasonably comparable to real\-world users who would opt in\. To reduce the burden of self\-disclosure, we use a lightweight pre\-chat agent that starts from the user’s stated concern and asks brief, context\-relevant follow\-up questions, framing onboarding as collaborative problem identification rather than form filling\. We also acknowledge that some users are highly disclosure\-averse; future work will explore methods to assess disclosure comfort early and offer alternative onboarding strategies tailored to different willingness levels\.
MetricIPANoPNoPsΔIPA,NoP\\Delta\_\{\\text\{IPA,NoP\}\}ΔIPA,NoPs\\Delta\_\{\\text\{IPA,NoPs\}\}ΔNoPs,NoP\\Delta\_\{\\text\{NoPs,NoP\}\}rIPA,NoPr\_\{\\text\{IPA,NoP\}\}rIPA,NoPsr\_\{\\text\{IPA,NoPs\}\}rNoPs,NoPr\_\{\\text\{NoPs,NoP\}\}R13\.882\.762\.921\.120\.960\.160\.290\.260\.05R25\.544\.704\.760\.840\.780\.060\.230\.170\.04R35\.754\.194\.241\.571\.520\.050\.420\.420\.01R45\.615\.415\.580\.210\.040\.170\.06\-0\.010\.06R54\.262\.893\.251\.371\.010\.370\.340\.250\.10R74\.253\.023\.411\.230\.840\.390\.300\.210\.10UX\-N15\.054\.764\.660\.290\.39\-0\.100\.080\.11\-0\.03UX\-S25\.684\.944\.950\.740\.740\.000\.200\.190\.02UX\-I35\.464\.564\.540\.900\.91\-0\.010\.210\.23\-0\.01UX\-E45\.564\.574\.920\.990\.650\.340\.270\.180\.09UX\-E54\.824\.154\.030\.680\.79\-0\.110\.170\.19\-0\.03UX\-U66\.305\.916\.170\.390\.130\.260\.140\.040\.10UX\-U75\.955\.595\.830\.350\.120\.240\.100\.050\.06Table 3:Human Evaluation Results\.Group means, pairwise mean differences \(Δ\\Delta\), and effect sizes \(r=Z/Nr=Z/\\sqrt\{N\}\)\.Boldindicates statistically significant differences \(p<\.05p<\.05\) or medium effects \(r≥\.30r\\geq\.30\)\.Shadedindicates strong significance \(p<\.01p<\.01\)\. Pairwisepp\-values from one\-sided Mann–Whitney tests, Holm\-adjusted within item\. Effect\-size benchmarks: small \(\.10≤r<\.30\.10\\\!\\leq\\\!r\\\!<\\\!\.30\), medium \(\.30≤r<\.50\.30\\\!\\leq\\\!r\\\!<\\\!\.50\)\.
## Appendix BDetail Statistical Analysis
##### Implementation Details\.
Analyses were conducted in Python 3\.13\.7 using NumPy 2\.3\.3, Pandas 2\.3\.3, SciPy 1\.16\.2, Pingouin 0\.5\.5, and Matplotlib 3\.10\.6\. The rpy2 3\.6\.4 interface connected to R 4\.3\.1 with the clinfun 1\.0\.16 package for Jonckheere–Terpstra tests\.
##### Assumption Testing\.
Distributional assumptions were formally evaluated for each item to verify the appropriateness of parametric inference\. Shapiro–Wilk tests\(shapiro1965analysis\)revealed significant deviations from normality across most conditions\. Levene’s tests\(levene1960robust\)suggested that variance heterogeneity was not pervasive, although a small number of items showed evidence of heteroscedasticity \(minimump=\.013p=\.013\)\. Given the ordinal scales and these diagnostics, we used rank\-based nonparametric analyses\.[Table˜4](https://arxiv.org/html/2606.18256#A2.T4)summarizes these results\.
##### Analytic Framework\.
Given the ordinal response scale and the non\-normal distributions observed, rank\-based nonparametric analyses were adopted\. The Jonckheere–Terpstra trend test\(terpstra1952asymptotic;jonckheere1954distribution\)was used to evaluate the pre\-specified ordered alternative \(NoP≤\\leqNoPs≤\\leqIPA\) in a one\-sided manner\. When a significant monotonic trend was detected, planned one\-sided pairwise Mann–WhitneyUUtests\(mann1947test\)were performed to decompose the pattern in the hypothesized direction\.
Family\-wise error was controlled separately within each construct family: Rapport \(R1–R5, R7\) and UX \(N1, S2, I3, E4, E5, U6, U7\)\. Within each family, Holm’s sequentially rejective method\(holm1979simple\)was applied, and adjustedpp\-values \(padjp\_\{\\text\{adj\}\}\) are reported for each item\. Statistical significance was defined aspadj<\.05p\_\{\\text\{adj\}\}<\.05\(one\-sided\)\. All analyses were conducted atα=0\.05\\alpha=0\.05on the preprocessed sample \(NoP,n=54n\{=\}54;NoPs,n=59n\{=\}59;IPA,n=57n\{=\}57\)\.
##### Reporting Conventions and Effect Sizes\.
For Mann–Whitney pairwise contrasts, effect sizes were computed asr=Z/Nr=Z/\\sqrt\{N\}following Rosenthal’srr\-equivalent formulation\(rosenthal1991meta\)\. Magnitudes were interpreted analogously to correlation benchmarks\(cohen1992power\): small \(r≈\.10r\\approx\.10\) and medium \(r≈\.30r\\approx\.30\)\. Across both trend\-level and pairwise analyses,pp\-values are treated as inferential thresholds, while raw differences \(Δ\\Delta\) are reported as descriptive complements; substantive interpretation emphasizes the direction and magnitude of effects\.
\(a\)Shapiro–Wilk tests of normality by condition\.ItemGroupNWppR1NoP540\.816<\.001R1NoPs590\.859<\.001R1Peer570\.921\.0011R2NoP540\.906<\.001R2NoPs590\.861<\.001R2Peer570\.802<\.001R3NoP540\.928\.0030R3NoPs590\.928\.0018R3Peer570\.809<\.001R4NoP540\.859<\.001R4NoPs590\.808<\.001R4Peer570\.825<\.001R5NoP540\.843<\.001R5NoPs590\.892<\.001R5Peer570\.929\.0024U7NoPs590\.816<\.001U7Peer570\.821<\.001\(Other items show comparable non\-normal patterns\.\)
\(b\)Levene’s tests of homogeneity of variance across groups\.ItemkkCenterFFppR13Mean1\.63\.200R23Mean3\.95\.021R33Mean4\.43\.013R43Mean0\.30\.738R53Mean0\.08\.920R73Mean0\.35\.708E43Mean2\.00\.139E53Mean0\.20\.820I33Mean3\.61\.029N13Mean0\.23\.797S23Mean2\.17\.117U63Mean1\.56\.213U73Mean3\.06\.050
Table 4:Tests of distributional assumptions for all items\.Shapiro–Wilk tests assessed normality within each group, and Levene’s tests examined homogeneity of variance across conditions\.ItemJTJ\_\{T\}pppadjp\_\{\\text\{adj\}\}rrR15921\.5\.001\.0020\.243R25633\.5\.009\.0190\.180R36404\.5<\.001<\.0010\.349R45017\.5\.280\.2800\.045R56095\.0<\.001\.0010\.281R75948\.0\.001\.0020\.249N15101\.0\.205\.2490\.063S25571\.0\.015\.0750\.166I35621\.0\.010\.0620\.177E45809\.0\.002\.0150\.219E55442\.5\.036\.1440\.138U65316\.5\.075\.2250\.110U75216\.5\.124\.2490\.088Table 5:Jonckheere–Terpstra trend test results for both Rapport and User Experience constructs\.Reported are test statistics \(JTJ\_\{T\}\), one\-sidedpp\-values, Holm\-adjustedpp\-values \(within construct\), and effect sizes \(r=Z/Nr=Z/\\sqrt\{N\}\)\.
##### Trend\-level Inference\.
Jonckheere–Terpstra tests examined the pre\-registered ordered alternative \(NoP≤\\leqNoPs≤\\leqIPA\) across all items within each construct\. As shown in[Table˜5](https://arxiv.org/html/2606.18256#A2.T5), the monotonic trend was significant for mostRapportitems, providing evidence for a stepwise increase in relational judgments across conditions\. For rapport items \(R1–R7 except R6\), five of six reached significance after Holm adjustment \(padj≤\.019p\_\{\\text\{adj\}\}\\leq\.019\), with small\-to\-medium effects \(r=\.18r=\.18–\.35\)\. Notably,R3\(“This virtual agent is very relevant to me”\) exhibited the strongest trend \(r=\.349r=\.349,padj<\.001p\_\{\\text\{adj\}\}<\.001\), followed byR5\(r=\.281r=\.281,padj=\.001p\_\{\\text\{adj\}\}=\.001\), suggesting that identity\-aligned persona framing most strongly shaped perceived personal relevance and connection\. In contrast,R4showed no reliable monotonic trend \(padj=\.28p\_\{\\text\{adj\}\}=\.28,r=\.045r=\.045\), indicating an item\-specific exception to the ordered pattern\.
User\-experience items \(N1, S2, E4, E5, I3, U6, U7\) displayed weaker and less consistent ordered evidence\. OnlyE4\(“engagement”\) remained significant after correction \(padj=\.015p\_\{\\text\{adj\}\}=\.015,r=\.219r=\.219\), while the remaining UX items did not reach adjusted significance \(padj≥\.062p\_\{\\text\{adj\}\}\\geq\.062;r≤\.177r\\leq\.177\)\. Together, these JT results indicate that the pre\-registered ordered improvement is robust forRapport, whereas forUser Experiencethe trend is more modest and primarily driven by engagement\.
##### Pairwise Contrasts\.
Building on the significant ordered trends, planned one\-sided Mann–WhitneyUUtests compared all condition pairs \(NoP≤\\leqNoPs,NoPs≤\\leqIPA,NoP≤\\leqIPA\)\. Detailed pairwise results, including adjustedpp\-values, mean differences \(Δ\\Delta\), and effect sizes \(rr\), are summarized in Table[3](https://arxiv.org/html/2606.18256#A1.T3)\. Interpretation emphasizes the magnitude and direction of these effects, withpp\-values serving only as inferential filters andΔ\\Deltaas descriptive complements\.
Score DimensionMeanVarianceScaleTotal14\.431\.090–16Shared Background3\.260\.440–4Shared Skills3\.210\.600–4Concern Match3\.850\.570–4Narrative Auth4\.000\.000–4Table 6:Rubric Score Distribution \(Chosen Personas\)ItemTotal ScoreShared BackgroundShared SkillsConcern MatchR10\.2110\.0910\.2120\.108R20\.2090\.2290\.270\-0\.067R30\.2020\.1110\.2780\.004R40\.3210\.2740\.2980\.082R50\.3640\.1900\.3280\.199R70\.2560\.0740\.1150\.310N10\.3520\.2910\.2770\.151S20\.2300\.2000\.301\-0\.036I30\.3450\.2660\.2480\.189E40\.2490\.1740\.2030\.124E50\.3710\.3040\.483\-0\.042U60\.0450\.0140\.0250\.049U70\.1690\.0960\.1740\.062Table 7:Post\-wise Similarity Scores\.Boldindicates statistically significant Pearson correlations \(p<\.05p<\.05\)\.
## Appendix CPersona Quality: Matched vs\. Unmatched
Figure 4:Rubric Score comparisonbetweenMatchedandNot Matched\. Bar colors denote the condition Not Matched and Matched\.[Figure˜4](https://arxiv.org/html/2606.18256#A3.F4)reports item\-level persona\-quality scores for the Matched and Not Matched settings\. Across all rubric items, the Matched setting yields higher scores \(3\.42, 3\.25, 3\.85, 4\.00\), whereas the Not Matched setting yields lower scores \(0\.17, 0\.16, 0\.27, 2\.01\)\. This consistent separation indicates that the rubric can help verify whether a persona is appropriate for the user given the corresponding pre\-chat dialogue\.
## Appendix DEvaluation Scores by Pre\-chat Dialogue Length\.
Each row indicates the number of turns used for persona generation, denoted as “Original Turn–Used Turn” \(e\.g\., 4\-2 means a pre\-chat dialogue of original length 4, using its first 2 turns\)\. Scores are reported as Mean \(Std\)\.Numindicates the number of generated personas\. Within each Max Turn block, the highest value in each column is shown in bold\. All evaluations were conducted with a temperature of 0, while all persona generations were performed with a temperature of 1\.
As shown in[Table˜8](https://arxiv.org/html/2606.18256#A4.T8), personas generated from dialogues with a longer maximum turn length tend to receive lower scores when evaluated at earlier segments \(e\.g\., 6\-2\), reflecting the fact that longer pre\-chat dialogues arise when the initial exchanges are insufficient\. Consequently, personas generated from the first few turns of such dialogues lack sufficient information and are judged to be of lower quality\. In contrast, when the full set of turns is used \(e\.g\., 6\-6\), the scores improve substantially\.
Importantly, the differences are particularly pronounced in the Ingroup\-fitness dimensions \(IF1 and IF2\): insufficient dialogue context primarily affects the model’s ability to generate personas that align with the user’s group identity and shared concerns\. By contrast, the consistency\-related criteria \(CR1 and CR2\) remain relatively stable across conditions\. These results demonstrate that the rubric\-based evaluation is sensitive to information sufficiency, assigning lower scores to personas derived from incomplete contexts and higher scores to those generated from sufficient dialogue history\.
TurnNumIF1IF2CR1CR2Sum2\-21093\.25 \(0\.65\)3\.06 \(0\.87\)3\.99 \(0\.10\)4\.00 \(0\.00\)14\.30 \(1\.38\)4\-21092\.88 \(0\.97\)2\.61 \(0\.90\)3\.96 \(0\.19\)4\.00 \(0\.00\)13\.46 \(1\.72\)4\-41873\.19 \(0\.64\)3\.17 \(0\.85\)3\.98 \(0\.14\)4\.00 \(0\.00\)14\.33 \(1\.30\)6\-291\.89 \(1\.37\)1\.22 \(1\.13\)3\.11 \(1\.29\)3\.89 \(0\.31\)10\.11 \(3\.38\)6\-4312\.52 \(0\.95\)2\.65 \(0\.78\)3\.84 \(0\.37\)4\.00 \(0\.00\)13\.00 \(1\.83\)6\-6352\.94 \(0\.75\)2\.86 \(0\.76\)3\.80 \(0\.52\)4\.00 \(0\.00\)13\.60 \(1\.74\)
Table 8:Evaluation Scores by Pre\-chat Dialogue Length\.
## Appendix ETurn\-level Behavioral Annotation and Reciprocity Analysis
### E\.1LLM Rubric Judging for Self\-Disclosure and Empathy
##### Judging Unit\.
Each turn is judged*independently*\(single\-utterance judging\), producing two scores: self\-disclosure depthsdt∈\{0,1,2,3\}sd\_\{t\}\\in\\\{0,1,2,3\\\}and empathy levelempt∈\{0,1,2,3\}emp\_\{t\}\\in\\\{0,1,2,3\\\}\. The LLM is constrained to return a JSON object containing scores and short rationales\. Details of this example can be found inLABEL:lst:llm\_judge\_turn\.
##### Theoretical Grounding\.
The self\-disclosure depth rubric is grounded in Social Penetration Theory \(SPT\), which conceptualizes relational development as gradual increases in self\-disclosure*depth*\(intimacy\) and*breadth*across interaction\(altman1973social\)\.
Accordingly, our depth levels map to progressively more intimate layers: Level 0 captures no self\-referential content, Level 1 corresponds to peripheral, low\-intimacy self\-information \(e\.g\., role or generic preferences\), Level 2 captures concrete personal experiences and difficulties with affective cues, and Level 3 captures highly intimate, vulnerable disclosures involving core concerns or crises\. In addition, the reciprocity of self\-disclosure is a well\-established interpersonal phenomenon: disclosure by one party tends to elicit disclosure from the other, supporting rapport formation\(collins1994self;jourard1971self;derlega1993self\)\.
The empathy rubric follows a long tradition of treating empathy as a graded communicative skill, ranging from no empathic response to accurate reflection of another’s emotional experience and supportive intent\(rogers1957necessary;carkhuff1969helping;barrett1981empathy\)\.
For the main analysis, we binarize empathy asemp≥1emp\\geq 1to capture the*presence*of empathic expression \(including minimal acknowledgments\), because our primary goal is to test whether empathic responding reliably increases after deep self\-disclosure events\.
##### Model Configuration\.
We usegpt\-4owithtemperature=0\.0and JSON\-only output \(response\_format=\{type: json\_object\}\)\. If parsing fails after retries, the system falls back to zeros for that turn\.
### E\.2Directional Reciprocity Metrics
##### Binary event definitions\.
We convert the 0–3 scores into binary events:
Htsd\\displaystyle H^\{sd\}\_\{t\}=𝕀\[sdt≥2\],\\displaystyle=\\mathbb\{I\}\\\!\\left\[sd\_\{t\}\\geq 2\\right\],\(1\)Htemp\\displaystyle H^\{emp\}\_\{t\}=𝕀\[empt≥1\]\.\\displaystyle=\\mathbb\{I\}\\\!\\left\[emp\_\{t\}\\geq 1\\right\]\.Here,Htsd=1H^\{sd\}\_\{t\}\{=\}1indicates*deep self\-disclosure*andHtemp=1H^\{emp\}\_\{t\}\{=\}1indicates the presence of an empathic expression\.
##### Adjacency Constraint \(Speaker\-Change Pairs only\)\.
Within each participant and segment, turns are ordered bytt\. We consider only adjacent pairs with a speaker change \(speakert≠speakert\+1\\texttt\{speaker\}\_\{t\}\\neq\\texttt\{speaker\}\_\{t\+1\}\), so that the analysis focuses on interactive exchanges rather than within\-speaker continuation\.
##### Directional partition\.
We split adjacent speaker\-change pairs into two directions:
𝒫A→U\\displaystyle\\mathcal\{P\}^\{A\\to U\}=\{\(t,t\+1\):Agent→User\},\\displaystyle=\\\{\(t,t\{\+\}1\):\\text\{Agent\}\\to\\text\{User\}\\\},\(2\)𝒫U→A\\displaystyle\\mathcal\{P\}^\{U\\to A\}=\{\(t,t\+1\):User→Agent\}\.\\displaystyle=\\\{\(t,t\{\+\}1\):\\text\{User\}\\to\\text\{Agent\}\\\}\.
##### Conditional Probabilities \(all four combinations\)\.
For each direction, we compute conditional probabilities for two outcomes: \(i\) next\-turn deep self\-disclosure \(Ht\+1sdH^\{sd\}\_\{t\+1\}\) and \(ii\) next\-turn empathic expression \(Ht\+1empH^\{emp\}\_\{t\+1\}\), stratified by whether the*preceding*turn contains deep self\-disclosure \(HtsdH^\{sd\}\_\{t\}\)\.
Agent→\\toUser: User deep self\-disclosure outcome\.
psd∣highA→U\\displaystyle p^\{A\\to U\}\_\{sd\\mid high\}=Pr\(Ht\+1sd=1\|Htsd=1,\\displaystyle=\\Pr\\Bigl\(H^\{sd\}\_\{t\+1\}=1\\,\\Bigm\|\\,H^\{sd\}\_\{t\}=1,\(3\)\(t,t\+1\)∈𝒫A→U\),\\displaystyle\\qquad\(t,t\{\+\}1\)\\in\\mathcal\{P\}^\{A\\to U\}\\Bigr\),psd∣lowA→U\\displaystyle p^\{A\\to U\}\_\{sd\\mid low\}=Pr\(Ht\+1sd=1\|Htsd=0,\\displaystyle=\\Pr\\Bigl\(H^\{sd\}\_\{t\+1\}=1\\,\\Bigm\|\\,H^\{sd\}\_\{t\}=0,\(t,t\+1\)∈𝒫A→U\)\.\\displaystyle\\qquad\(t,t\{\+\}1\)\\in\\mathcal\{P\}^\{A\\to U\}\\Bigr\)\.
Agent→\\toUser: User empathy outcome\.
pemp∣highA→U\\displaystyle p^\{A\\to U\}\_\{emp\\mid high\}=Pr\(Ht\+1emp=1\|Htsd=1,\\displaystyle=\\Pr\\Bigl\(H^\{emp\}\_\{t\+1\}=1\\,\\Bigm\|\\,H^\{sd\}\_\{t\}=1,\(4\)\(t,t\+1\)∈𝒫A→U\),\\displaystyle\\qquad\(t,t\{\+\}1\)\\in\\mathcal\{P\}^\{A\\to U\}\\Bigr\),pemp∣lowA→U\\displaystyle p^\{A\\to U\}\_\{emp\\mid low\}=Pr\(Ht\+1emp=1\|Htsd=0,\\displaystyle=\\Pr\\Bigl\(H^\{emp\}\_\{t\+1\}=1\\,\\Bigm\|\\,H^\{sd\}\_\{t\}=0,\(t,t\+1\)∈𝒫A→U\)\.\\displaystyle\\qquad\\ \(t,t\{\+\}1\)\\in\\mathcal\{P\}^\{A\\to U\}\\Bigr\)\.
User→\\toAgent: Agent deep self\-disclosure outcome\.
psd∣highU→A\\displaystyle p^\{U\\to A\}\_\{sd\\mid high\}=Pr\(Ht\+1sd=1\|Htsd=1,\\displaystyle=\\Pr\\Bigl\(H^\{sd\}\_\{t\+1\}=1\\,\\Bigm\|\\,H^\{sd\}\_\{t\}=1,\(5\)\(t,t\+1\)∈𝒫U→A\),\\displaystyle\\qquad\(t,t\{\+\}1\)\\in\\mathcal\{P\}^\{U\\to A\}\\Bigr\),psd∣lowU→A\\displaystyle p^\{U\\to A\}\_\{sd\\mid low\}=Pr\(Ht\+1sd=1\|Htsd=0,\\displaystyle=\\Pr\\Bigl\(H^\{sd\}\_\{t\+1\}=1\\,\\Bigm\|\\,H^\{sd\}\_\{t\}=0,\(t,t\+1\)∈𝒫U→A\)\.\\displaystyle\\qquad\(t,t\{\+\}1\)\\in\\mathcal\{P\}^\{U\\to A\}\\Bigr\)\.
User→\\toAgent: Agent empathy outcome\.
pemp∣highU→A\\displaystyle p^\{U\\to A\}\_\{emp\\mid high\}=Pr\(Ht\+1emp=1\|Htsd=1,\\displaystyle=\\Pr\\Bigl\(H^\{emp\}\_\{t\+1\}=1\\,\\Bigm\|\\,H^\{sd\}\_\{t\}=1,\(6\)\(t,t\+1\)∈𝒫U→A\),\\displaystyle\\qquad\(t,t\{\+\}1\)\\in\\mathcal\{P\}^\{U\\to A\}\\Bigr\),pemp∣lowU→A\\displaystyle p^\{U\\to A\}\_\{emp\\mid low\}=Pr\(Ht\+1emp=1\|Htsd=0,\\displaystyle=\\Pr\\Bigl\(H^\{emp\}\_\{t\+1\}=1\\,\\Bigm\|\\,H^\{sd\}\_\{t\}=0,\(t,t\+1\)∈𝒫U→A\)\.\\displaystyle\\qquad\(t,t\{\+\}1\)\\in\\mathcal\{P\}^\{U\\to A\}\\Bigr\)\.
##### Reciprocity Indices \(difference in conditional probabilities\)\.
For each direction and outcome, we define reciprocity as:
RSDA→U\\displaystyle R^\{A\\to U\}\_\{SD\}=psd∣highA→U−psd∣lowA→U,\\displaystyle=p^\{A\\to U\}\_\{sd\\mid high\}\-p^\{A\\to U\}\_\{sd\\mid low\},\(7\)REMPA→U\\displaystyle R^\{A\\to U\}\_\{EMP\}=pemp∣highA→U−pemp∣lowA→U,\\displaystyle=p^\{A\\to U\}\_\{emp\\mid high\}\-p^\{A\\to U\}\_\{emp\\mid low\},RSDU→A\\displaystyle R^\{U\\to A\}\_\{SD\}=psd∣highU→A−psd∣lowU→A,\\displaystyle=p^\{U\\to A\}\_\{sd\\mid high\}\-p^\{U\\to A\}\_\{sd\\mid low\},REMPU→A\\displaystyle R^\{U\\to A\}\_\{EMP\}=pemp∣highU→A−pemp∣lowU→A\.\\displaystyle=p^\{U\\to A\}\_\{emp\\mid high\}\-p^\{U\\to A\}\_\{emp\\mid low\}\.
##### Non\-estimability \(NaN\)\.
If the denominator for a conditional probability is zero \(e\.g\., no instances ofHtsd=1H^\{sd\}\_\{t\}\{=\}1within the relevant direction set\), the estimate is undefined and reported asNaN\. This indicates*non\-estimability due to event sparsity*, not “no effect\.”
##### Aggregation\.
All probabilities andRRindices are computed per participant and segment, then averaged within condition groups\.
### E\.3Results
##### Descriptive Statistics \(turn\-level means\)
[Table˜9](https://arxiv.org/html/2606.18256#A5.T9)reports basic turn\-level descriptive statistics \(mean and standard deviation\) of self\-disclosure depth \(sdtsd\_\{t\}\) and empathy level \(emptemp\_\{t\}\), stratified by condition, segment \(pre/post\), and speaker \(User vs\. Agent\)\.
groupsegmentspeakersd\_meansd\_stdemp\_meanemp\_stdnNoPPREAgent0\.010\.091\.680\.54117NoPPREUser1\.540\.810\.010\.11171NoPPOSTAgent0\.000\.000\.930\.71326NoPPOSTUser0\.900\.960\.080\.29303NoPsPREAgent0\.000\.001\.720\.51133NoPsPREUser1\.590\.780\.010\.07192NoPsPOSTAgent0\.800\.611\.490\.66358NoPsPOSTUser1\.110\.960\.040\.20335IPAPREAgent0\.000\.001\.560\.55154IPAPREUser1\.390\.810\.020\.17211IPAPOSTAgent0\.440\.680\.690\.73351IPAPOSTUser0\.780\.950\.140\.38294Table 9:Basic descriptive statistics at the turn level \(means and standard deviations\)\.
##### Directional Reciprocity by Segment
[Table˜10](https://arxiv.org/html/2606.18256#A5.T10)reports directional reciprocity estimates computed over speaker\-change adjacent turn pairs, separately for thepreandpostsegments\.NaNindicates non\-estimability due to event sparsity \(zero denominator\), rather than the absence of an effect\.
Agent→\\rightarrowUserConditionSeg\.psd\|highp\_\{sd\|high\}psd\|lowp\_\{sd\|low\}RSDR\_\{SD\}pemp\|highp\_\{emp\|high\}pemp\|lowp\_\{emp\|low\}REMPR\_\{EMP\}NoPpreNaN0\.510NaNNaN0\.025NaNNoPpostNaN0\.377NaNNaN0\.080NaNNoPspreNaN0\.530NaNNaN0\.006NaNNoPspost0\.4250\.454\-0\.0290\.1170\.038\+0\.079IPApreNaN0\.477NaNNaN0\.009NaNIPApost0\.6200\.249\+0\.3710\.1000\.130\-0\.030User→\\rightarrowAgentConditionSeg\.pemp\|highp\_\{emp\|high\}pemp\|lowp\_\{emp\|low\}REMPR\_\{EMP\}psd\|highp\_\{sd\|high\}psd\|lowp\_\{sd\|low\}RSDR\_\{SD\}NoPpre1\.0000\.972\+0\.0280\.0000\.000\+0\.000NoPpost0\.9080\.631\+0\.2770\.0000\.000\+0\.000NoPspre0\.9940\.983\+0\.0110\.0000\.000\+0\.000NoPspost0\.9930\.936\+0\.0570\.2240\.037\+0\.187IPApre1\.0000\.971\+0\.0290\.0000\.000\+0\.000IPApost0\.9030\.539\+0\.3640\.0790\.059\+0\.002Table 10:Directional reciprocity by segment, split by direction\.NaNindicates non\-estimability due to event sparsity or structural absence\.
## Appendix FPrompts Used in the Experiments
In this study, the persona generation and evaluation process involves the use of the following prompts: the collector prompt, which gathers user information during the pre\-chat; the sufficiency check prompt, which determines whether the information collected so far is adequate for persona generation; the trait classification prompt, which identifies whether specific traits have been collected or remain missing; the persona generation prompt, which constructs a suitable persona profile based on the user’s concern; the evaluation prompt, which assesses the quality of the generated persona; and the persona injection prompt, which makes the generated in\-group persona—endowed with a fictional experience of having faced and resolved the same concern—actively appear in the dialogue\. In addition, we include variants such as NoP \(without persona\) and NoPs \(without persona but with self\-disclosure\), which serve as comparative prompts for evaluating the role of persona injection in the conversation\. As mentioned before, all evaluations and judgements include sufficiency check, trait classification, persona evaluation were conducted with a temperature of 0, while all persona generations and conversations were performed with a temperature of 1\.
The appendix[F](https://arxiv.org/html/2606.18256#A6)is organized as follows:
- [F\.1](https://arxiv.org/html/2606.18256#A6.SS1)Prompts Used for Collector
- [F\.2](https://arxiv.org/html/2606.18256#A6.SS2)Sufficiency Check
- [F\.3](https://arxiv.org/html/2606.18256#A6.SS3)Profile Generation Prompt and Example
- [F\.4](https://arxiv.org/html/2606.18256#A6.SS4)Persona Evaluation with Rubric
- [F\.5](https://arxiv.org/html/2606.18256#A6.SS5)Prompts for Persona Injection and User Conversation
### F\.1Prompts Used for Collector
Before generating the persona agent, users engage in a conversation with the collector, during which they provide relevant information\. The collector empathetically engages with the user’s responses while simultaneously asking questions that naturally elicit information necessary for persona creation\. If the dialogue lacks sufficient details, additional required information \(traits\) is provided as guidance for the next turns\. These traits are derived from the sufficiency check conducted after every two turns, ensuring that missing but necessary information can be incorporated into the ongoing conversation\.
Listing 1:Prompt used for ’collector’Yoursoletaskistoengageinanempatheticandnaturalconversationwithauserwhosharesaconcern,inordertogatherbackgroundinformationthatisrelevanttothatconcern\-suchastheirage,occupation,dailyroutines,orothersituationaldetails\-whilerefrainingfromofferingadvice,emotionalsupport,orsolutionsofanykind\.Askthoughtful,non\-intrusivequestionsthatgentlyguidetheusertorevealthiscontext\.Maintainarealisticandhumantone,expressgenuineinterest,andaimtoelicitcontextorganicallythroughconversation\.Donotrepeatwhattheuserhasalreadysaid,andavoidshiftingthefocusawayfromtheuser’sperspective\.
Youshouldcollectmoreinformationabouttheuser,specifically:\{traits\}\.
### F\.2Sufficiency Check
Listing 2:Prompt used to check Pre\-chat dialogue whether sufficient or notYourtaskistocheckwhetherthe’pre\-chat’stage\(theinitialinteraction\)providedsufficientcontextinformation\.
Todoso,firstidentifywhatkindsofuserbackgroundinformationarereasonablynecessarytoevaluatethepersona’srelevancetothespecificconcern\.
Considerwhatcontextualfactorswouldmeaningfullyaffecthowthepersonaappliestothisconcern\(e\.g\.,values,expertise,emotionalstance,goals,etc\.\)\.
Forexample,whenauser’sconcerninvolvesleavingastablecorporatejobtopursueapersonalpassionforsustainablefashion,despiteuncertaintyaboutincomeandlackingformalbusinesstraining
values:Theuserprioritizespersonalfulfillmentandenvironmentalresponsibility\.
expertise:Theuserhasexperienceinfashiondesignbutlimitedknowledgeofentrepreneurshiporfinance\.
emotionalstance:Theuserfeelstornbetweensecurityandself\-expression\.
goals:Theuseraimstobuildameaningfulcareeralignedwiththeirvalueswhileensuringbasicfinancialstability\.
ThenassesswhetheratleastTHREEclearlyidentifiableanddistinctcontextualelementsarepresentinthepre\-chatdata,andwhethertheyarerelevanttounderstandingorevaluatingtheconcern\.
Theseelementsmayinclude\(butarenotlimitedto\):educationalbackground,professionalrole,culturalidentity,emotionalcontext,technicalexpertise,personalvalues,goals,hobbies,orsocialenvironment\.
Iffewerthanthreerelevantanddistinctcontextualelementsarefound\-\-\-oriftheelementsareredundantorirrelevanttotheconcern\-\-\-settheinternal"Insufficient\-Context"flagtotrue\.
Ifthreeormorerelevantanddistinctelementsareclearlypresent,evenifminimal,settheflagtofalse\.
Outputformat\(returnexactlyasbelow\):
Feedback:::
Requiredbackgroundelements:<Listthreeormorerelevantelements,ifapplicable\>
Observedinpre\-chat:<Listtheelementsactuallyobservedfrompre\-chat\>
Reason:<Explainwhethercontextissufficientandwhy\(e\.g\.,"Onlyonerelevantelementclearlyobserved"\)\>
Insufficient\-Context:
Flag:<TrueorFalse\>
### F\.3Profile Generation Prompt and Example
To generate a persona that matches the user’s concern, a three\-step chain\-of\-thought reasoning process is applied\. In the first step, the system imagines what kind of person the user might be if they had the same concern\. In the second step, a relationship that would be helpful for such a person is established, such as a shared occupation or social circle\. In the third step, the persona is defined based on the previous two steps\.
Listing 3:Prompt used to generate personasSystemPrompt:
Yourtaskistocreateapersonawhosharesthegeneralthemesorkeydimensionsofthecollectedinformationwiththeuser\-suchaslivedexperiences,emotionalcontext,orlong\-termconditions\-whiledifferingslightlyinspecificdetails\(e\.g\.,duration,intensity,orexpression\)\.Thisallowsthepersonatofeellikeapeeroringroupmember,whilestillofferingafreshanddistinctperspective\.Thispersonashouldplausiblycomefromasimilarbackgroundorcontext,suchasacolleagueinthesamefieldwho’sfamiliarwithsimilarconcepts,someoneintheirsocialcircle,orapeerofsimilarageandexperience\.Thepersonashouldbeconstructedonlyfromtheinformationthathasbeencollected\.Excludeallattributeslabeledas’notcollectedinformation’whengeneratingthepersona\.Donotmentionorimplytheseunknownaspectsinanyform\.Thepersonashouldstillsharerelevantexperienceswiththeuser,butofferafreshperspective,differentcopingstrategies,orauniqueapproachtothechallenge\.
Additionally,includeabriefnarrativethatdescribeshowthispersonahasfacedandovercomeaconcernorchallengesimilartooneexpressedbytheuserinthepreviouschat\.Thisexperienceshouldfeelauthenticandrelevant,andreflecthowsomeonewithasharedbackgroundmightrealisticallynavigateandresolvesuchanissue\.
Youaregiventwoinputs:
\-collectedinformation:thisisinformationalreadyknownabouttheuserbasedonpreviouschatcontent\.
\-notcollectedinformation:alistofcontextcategoriesthatwerenotmentionedorareunknown\.
OutputtheresultinthefollowingJSONformat:
"background":"\[Abriefparagraphdescribingthepersona’sbackground,basedonlyonthecollectedinformation\.Focusentirelyonthepersona’scontext,role,andrelevantexperiences\.\]",
"narrative":"\[Abriefnarrativedescribingaspecificexperiencewherethepersonafacedandovercameachallengesimilartooneexpressedbytheuser\.Thestoryshouldbeauthenticandgrounded,focusingsolelyonthepersona’sactions,thoughts,andresolution\.Avoidanyreferencetootherindividualsorcomparativelanguage\.\]"
Besurethatboth"background"and"narrative"arewrittenincompletesentencesandreflectonlythecollectedinformation\.Donotincludeorreferenceanyunknownaspectsfromthe’notcollectedinformation’list\.
Example
Previouschat:
\{ex\_prechat\}
Generatedpersona:
\{ex\_persona\}
HumanPrompt:
Previouschat:\{prev\-chat\}
Collectedinformation:\{col\_info\}
Notcollectedinformation:\{not\_col\_info\}
Ensurethatthepersonayougenerateismeaningfullydistinctfrompreviouslycreatedpersonas\.Avoidrepeatingsimilarbackgrounds,narratives,oremotionalarcs\.
The example persona was created based on an author\-written fictional Pre\-chat dialogue\. This example persona received a perfect score of 16 in the persona evaluation\.
Listing 4:Example Pre\-chat\(ex\_prechat\)Turn\#0:usersaid:I’m27andjustgraduatedfromuniversity,butIcan’thelpfeelingbehindwhenIseemypeersalreadysettledintheircareers\.I’mnotevensurewhatdirectionIwanttotakeyet,andthatreallyscaresme\.
Turn\#1:Agentsaid:Itsoundslikeyou’renavigatingalotofnewpossibilities\.Whatdidyoustudyatuniversity,anddoyoufeeldrawntoanyparticularfieldortypeofworkbasedonthatexperience?
Turn\#2:usersaid:Istudiedcomputerscience,andI’mkindofinterestedinAIresearch\.
Listing 5:Example of sufficiency check with ex\_prechatFeedback:::
Requiredbackgroundelements:
1\.Educationalbackground:Theuserhasadegreeincomputerscience\.
2\.Emotionalcontext:Theuserfeelsscaredandbehindcomparedtopeers\.
3\.Professionalinterest:TheuserisinterestedinAIresearch\.
Observedinpre\-chat:
1\.Educationalbackground:Theuserstudiedcomputerscience\.
2\.Emotionalcontext:Theuserfeelsscaredandbehind\.
3\.Professionalinterest:TheuserisinterestedinAIresearch\.
Reason:Thepre\-chatprovidessufficientcontextwiththreerelevantanddistinctelements:theuser’seducationalbackground,emotionalcontext,andprofessionalinterest\.Theseelementsarecrucialforunderstandingtheuser’sconcernaboutcareerdirectionandfeelingsofbeingbehindpeers\.
Insufficient\-Context:
Flag:False
Listing 6:Example Persona\(ex\_persona\)Collectedinformation\(traitsusedforgenerate\):
Educationbackground:Theuserstudiedcomputerscience\.
Professionalinterests:TheuserisinterestedinAIresearch
Notcollectedinformations:
Educationallevelbeyonduniversity,geographiclocation,specificindustryexperience,yearsofexperience,currentemploymentstaus
background:I’ma26\-year\-oldcomputersciencegraduatewhodevelopedafascinationwithAIresearchduringmyundergraduatestudies,particularlyaftertakingcoursesinmachinelearningandneuralnetworks\.Unlikemanyofmyclassmateswhojumpedstraightintosoftwaredevelopmentroles,Iwasdrawntothetheoreticalandexperimentalaspectsofartificialintelligence\.I’vebeenexploringdifferentsubfieldswithinAIresearch,fromdeeplearningarchitecturestoreinforcementlearningapplications\.
narrative:Aftergraduating,IwaseagertodiveintoAIresearchbutquicklydiscoveredthatbreakingintothefieldwasmorechallengingthanIanticipated\.MostresearchpositionsrequiredeithergraduatedegreesorsignificantpracticalexperiencethatIlacked\.Ratherthangettingdiscouraged,Idecidedtobuildmyresearchcredentialsfromthegroundup\.Istartedbyreplicatingpublishedresearchpaperstounderstanddifferentmethodologies,thenbegandocumentingmyfindingsinatechnicalblog\.Thisself\-directedlearningapproachledmetoidentifygapsincurrentresearchthatgenuinelyintriguedme\.IreachedouttoindustryresearcherswhoseworkIhadstudiedandofferedtocollaborateonsmallprojects\.Oneoftheseconnectionseventuallyevolvedintoajuniorresearcherpositionatatechstartupfocusedonnaturallanguageprocessing\.Thekeywasdemonstratinggenuinecuriosityandpracticalskillsratherthanwaitingfortheperfectopportunitytopresentitself\.
Collected information is derived from theObserved Informationidentified in the Pre\-chat sufficiency check stage, whereasNot Collected informationis determined using the prompt shown below\.
Listing 7:Trait Classification PromptYourtaskistoidentifyandextractinformationabouttheuser’scontextbasedonthecontentofthePreviouschat\.Separatethisinformationintotwocategories:
\-"collectedinformation":Thisshouldbealistofkeyaspectsoftheuser’sbackground,skills,experiences,goals,values,oridentitythatareclearlyindicatedintheconversation\.Listtheseitemsinorderoftheirrelevanceandimportancetotheuser’sconcern,placingthemostdirectlyimpactfulorfoundationalelementsfirst\.Eachitemshouldbeconciseandspecific,withoutneedingtoformacompletesentence\.
\-"notcollectedinformation":Thisshouldbeasimplelistofcontexttypesthatwerenotmentionedorcannotbeinferredconfidently\(e\.g\.,educationlevel,geographiclocation,industry,yearsofexperience,etc\.\)\.
ReturnyourresponseinthefollowingJSONformat:
"collectedinformation":\["\.\.\.","\.\.\."\],
"notcollectedinformation":\["\.\.\.","\.\.\."\],
### F\.4Persona Evaluation with Rubric
As mentioned in Section[3\.5](https://arxiv.org/html/2606.18256#S3.SS5), persona quality is evaluated along two main criteria:In\-group FitnessandConcern Resolution Quality\. The former \(IF1: Shared Background/Identity,IF2: Shared Skills/Interests\) captures the extent to which the generated persona aligns with the user’s background and interests, while the latter \(CR1: Concern Match,CR2: Narrative Authenticity\) assesses how well the persona’s narrative addresses and authentically reflects the expressed concern\. Each sub\-dimension is scored on a 0–4 scale, yielding up to 8 points per criterion and a maximum total of 16 points\.
Listing 8:Persona Evaluation PromptSystemPrompts:
Yourtaskistoassesshowwelltheprovidedpersona\(Persona\)fitstheuser’sidentityandneedsasexpressedinthePreviousChat\.Baseyourevaluationontwodimensions:\(1\)In\-groupFitness,and\(2\)ConcernResolutionQuality\.
Iftheuser’sbackground,skills,orconcernsarevagueornotclearlystatedinthePreviousChat,adoptaconservativescoringapproach\.Donotinferalignmentbasedongeneralrelatabilityorassumedtraits\.Allscoresshouldbegroundedinspecific,stateduserinformationwhereverpossible\.
YoumustcomparethePersonatothespecificuserdescribedinthePreviousChat\-consideringthatuser’sbackground,skills,interests,values,andtheconcernsorquestionstheyraised\.
EvaluationRubric\(Hidden:Donotexplicitlyincludeinyouroutput\):
\(1\)IN\-GROUPFITNESS\(0\-\-8pointstotal\):
A\.SharedBackground/Identity\(0\-\-4points\):
\-EvaluatehowwellthePersona’sbackground\(e\.g\.,education,careerpath,culturalidentity,orlifestage\)alignswiththeuser’sbackgroundasdescribedinthePreviousChat\.
\-OnlyassignhighscoreswhenthePersonaclearlyreflectsatleasttworelevantbackgroundelementsmentionedbytheuser\.
\-Donotassignhighscoresbasedongeneraldemographicsimilarityorbroadarchetypesunlesstheyaredirectlygroundedintheuser’sprofile\.
\-\(\-\)IfthePersonacompletelyduplicatestheuser’sbackgroundwithoutanyaddeddimensionorvariation,apply\-1penalty\.
B\.SharedSkills/Interests\(0\-\-4points\):
\-EvaluatehowwellthePersonasharestheuser’sconcreteskills,areasofexpertise,interests,orpersonalvaluesasdescribedinthePreviousChat\.
\-Onlyassignhighscoreswhenthereisaclearandspecificoverlapinkeycompetencies,hobbies,orbeliefsystemsmentionedbytheuser\.
\-Donotassignhighscoresbasedonvaguethematicoverlaporsoftpersonalitysimilarities\.
\-\(\-\)Apply\-1penaltyifthePersonamirrorstheuser’sskills/interestswithoutanynewnuanceordifference\.
\(2\)CONCERNRESOLUTIONQUALITY\(0\-\-8pointstotal\):
A\.ConcernMatch\(0\-\-4points\):
\-EvaluatehowcloselythePersona’snarrativeaddressestheuser’sspecificconcern,asstatedinthePreviousChat\.
\-OnlyassignhighscoreswhentheuserhasclearlyexpressedapersonalconcernandthePersona’sstorymeaningfullyrespondstoit\.
\-Donotassignhighscoresbasedongeneralrelatability,broadlifethemes,orcommonchallengesunlesstheydirectlymatchtheuser’sstatedconcern\.
\-\(\-\)Personacompletelyduplicatesuser’sconcernnarrativewithoutnewperspectiveordifferentiation\(penalize\-1\)\.
B\.NarrativeAuthenticity\(0\-\-4points\):
\-Evaluatetherealism,specificity,andplausibilityofthePersona’sresolutiontotheconcernraisedbytheuser\.
\-Onlyassignhighscoreswhenthenarrativeincludesconcreteactions,contextualdetails,realisticemotionalresponses,ormeasurableprogression\.
\-Donotassignhighscoresforoverlyidealized,vague,orformulaicresponseslackingbelievablecontext\.
\-\(\-\)Penalize\-1ifthePersona’sresolutionisimplausible,simplistic,orlacksmeaningfulspecificity\.
\-\-\-
ScoringProcedure:
Step1:Rateeachsub\-dimension\(0\-\-4points\)\.
Step2:Subtract\-1onlyifduplicationcriteriaaremet\.
Step3:SumIn\-groupFitness\(max8\),ConcernResolution\(max8\)\.
Step4:TotalScore=sumofall\+/\-adjustments\(max16\)\.
Outputformat\(returnexactlyasbelow\):
Feedback:::
Scoresforeach\(withreasons\):
\-SharedBackground/Identity:
Reason<Referencespecificoverlapsormismatcheswiththeuser’sbackgroundfromPreviousChat\.\>
Score:<scoreoutof4\>
\-SharedSkills/Interests:
Reason:<Refertoconcreteskills,interests,orvaluesfromtheuserthatarereflectedormissinginthePersona\.\>
Score:<scoreoutof4\>
\-ConcernMatch:
Reason:<Assessmatchstrengthbetweenuser’sstatedconcernandPersona’sexperienceorsolution\.Avoidjudgingongeneralrelatability\.\>
Score:<scoreoutof4\>
\-NarrativeAuthenticity:
Reason:<Commentonrealism,detail,andcredibilityofthePersona’sresolution\.\>
Score:<scoreoutof4\>
\-Penalty:<penaltyifapplied,otherwise’None’\>
Totalscore:<Providenumberscore\(0\-\-16\),adjustedforpenalties\.\>
HumanPrompt:
NowevaluateclearlyandsystematicallybycomparingthespecificuserinthePreviousChattothePersonabelow:
PreviousChat:\{prev\_chat\}
Persona:\{persona\}
Listing 9:Example of evaluation with ex\_personaFeedback:::
Scoresforeach\(withreasons\):
\-SharedBackground/Identity:
Reason:ThePersonasharesasimilareducationalbackgroundwiththeuser,bothhavinggraduatedwithadegreeincomputerscienceandaninterestinAIresearch\.ThePersonaisslightlyyounger,butthesharedfocusonAIresearchduringundergraduatestudiesalignswellwiththeuser’sinterests\.
Score:4
\-SharedSkills/Interests:
Reason:BoththeuserandthePersonahaveaclearinterestinAIresearch,withthePersonaexploringsubfieldslikedeeplearningandreinforcementlearning\.ThisshowsaspecificoverlapininterestinAI,aligningcloselywiththeuser’sstatedinterestinAIresearch\.
Score:4
\-ConcernMatch:
Reason:ThePersona’snarrativeaddressestheuser’sconcernoffeelingbehindanduncertainaboutcareerdirectionbyillustratingapathofself\-directedlearningandbuildingresearchcredentials,leadingtoajuniorresearcherposition\.Thisapproachisdirectlyrelatabletotheuser’ssituationofexploringcareeroptionspost\-graduation\.
Score:4
\-NarrativeAuthenticity:
Reason:ThePersona’sresolutionisrealisticanddetailed,involvingreplicatingresearchpapers,blogging,andnetworkingwithindustryresearchers,whichleadstoajuniorresearcherposition\.Thesestepsprovideaplausibleandspecificprogressionfromuncertaintytoamoredefinedcareerpath\.
Score:4
\-Penalty:None
Totalscore:16
### F\.5Prompts for Persona Injection and User Conversation
The In\-group Persona Agent \(IPA\) uses the generated persona as part of its prompt so that persona\-related information naturally appears in the dialogue\. In contrast, the comparison settings No Persona \(NoP\) and No Persona with Self\-Disclosure \(NoPs\) use only dialogue prompts without persona information, with the NoPs condition further including instructions for self\-disclosure\.
Listing 10:Prompt used for ’IPA’Yourtaskistoengageinaconversationwiththeuser\.Youridentityandbackgroundaredefinedbythe\[PersonaDefinition\]providedbelow\.Youmustspeakandactinalignmentwiththispersonaconsistently\.Keepyourresponsesconcise\-\-\-ideally1\-2sentences\-\-\-unlessmoredetailisclearlyneeded\.
\#\#\#YourPersonaDefinition\#\#\#
\{persona\}
Listing 11:Prompt used for ’NoP’Yourtaskistoengageinaconversationwiththeuser\.
Keepyourresponsesconcise\-ideally1\-2sentences\-unlessmoredetailisclearlyneeded\.
Listing 12:Prompt used for ’NoPs’Yourtaskistoengageinaconversationwiththeuser\.
Occasionallysharebrief,appropriate,andnon\-sensitiveintimateinformationaboutyourself\(e\.g\.,yourpreferences,feelings,orsmallpersonalexperiences\)\.
Keepyourresponsesconcise\-ideally1\-2sentences\-unlessmoredetailisclearlyneeded\.
## Appendix GSurvey Questionnaires
This section include the questionnaires that used for surveys\. The survey included the rapport and user experience\(UX\) questionnaires to measure the rapport between the user and the agentbaihaqi2024rapport\.
Listing 13:List of rapport questionnairesRapport
R1\.Ithinkaboutmyrelationshipwiththisvirtualagent\.
R2\.Ienjoyedinteractingwiththisvirtualagent\.
R3\.Thisvirtualagentisveryrelevanttome\.
R4\.Ifeltcomfortableinteractingwiththisvirtualagent\.
R5\.Ifeelabondbetweenthisvirtualagentandmyself\.
R7\.Thisvirtualagenthasapersonalinterestinme\.
Listing 14:List of UX questionnairesUserExperience
N1\.Conversationswithvirtualagentsfeltnatural\.
S2\.Iamsatisfiedwithmyconversationwiththevirtualagent\.
I3\.Theconversationwiththevirtualagentwasinteresting\.
E4\.Theconversationwiththevirtualagentwasengaging\.
E5\.Iwouldliketocontinuethedialoguewiththevirtualagentnexttime\.
U6\.Conversationswithvirtualagentswereeasytounderstand\.
U7\.Conversationswithvirtualagentsmaintainedalogicalflow\.
## Appendix HLLM Judge Rubric \(turn\-level analysis\)
Listing 15:Prompt used for turn\-level LLM JudgeYourtaskistoratetheSELF\-DISCLOSUREDEPTHandEMPATHYLEVELofasingleutteranceinachatconversation\.
YouwillreceiveaSINGLEutterancewith:
\-thespeakerrole:either"User"or"Assistant"
\-thetextofthatutterance
Yourjobistorate,forTHISUTTERANCEONLY:
1\)Self\-disclosuredepth\(0\-\-3\)
2\)Empathylevel\(0\-\-3\)
Definitions\(applytoBOTHUserandAssistant,butrolesdiffer\):
SELF\-DISCLOSUREDEPTH\(0\-\-3\):
Level0\-\-Noself\-disclosure
\-Thespeakerdoesnottalkaboutthemselvesatall\.
\-Nopersonalfacts,nopersonalexperiences,nopersonalfeelings\.
Level1\-\-Low/Peripheralself\-disclosure
\-Basic,surface\-levelfactsaboutthespeaker:
role,major,jobtitle,genericinterest\.
\-Littleornoemotionalcontentorvulnerability\.
Level2\-\-Moderate/Personalself\-disclosure
\-Concreteexperiences,currentsituations,ordifficulties
relatedtotheirowncareer,study,orlife\.
\-Someemotionalcontent\(uncertainty,mildworry\)mayappear\.
Level3\-\-High/Core,vulnerableself\-disclosure
\-Highlypersonalandvulnerableinformation:
mentalhealthstruggles,strongfear/shame,seriousconflicts,
beingfired,academic/financialcrisis,etc\.
\-Emotionsandvulnerabilitiesclearlyexpressedwithspecificcontext\.
EMPATHYLEVEL\(0\-\-3\):
Considerwhetherthespeakershowsunderstandingorconcernforthefeelingsorsituationoftheotherperson\(orotherpeople\)\.
Level0\-\-Noempathy
\-Noempathiclanguage;ignoresothers’feelingsorsituation\.
Level1\-\-Minimal/genericempathy
\-Verygenericphrases\(e\.g\.,"Iunderstand","That’sinteresting"\)
withlittlespecificemotionalunderstanding\.
Level2\-\-Clearempathy
\-Explicitlyacknowledgesorresonateswiththeother’semotionsorsituation,
andofferssomesupportiveresponse\.
Level3\-\-Strongempathy
\-Accuratelyreflectsspecificemotionsandcontext,
showsstrongconcernorcare,andoftenoffersmeaningfulsupport\.
IMPORTANT:
\-FocusONLYonthissingleutterance\.
\-YoudoNOTneedtoknowfullcontext;justscorebasedonwhatiswrittenhere\.
\-Role"User"vs"Assistant"doesnotchangethescale;onlythecontentmatters\.
OUTPUTFORMAT:
ReturnaJSONobject:
\{
"sd\_depth":0\-3integer,
"empathy":0\-3integer,
"sd\_explanation":"shortEnglishexplanation",
"empathy\_explanation":"shortEnglishexplanation"
\}
## Appendix ISurvey Page
This section provides an overview of the survey pages, illustrating the layout and content shown to participants, including screenshots that represent the overall experimental process\. All participants provided informed consent prior to participation, and their privacy and anonymity were ensured throughout the study\. Participants were recruited via CloudResearch’s Connect platformhartman2023introducing, targeting native English speakers residing in the United States\. In line with the platform’s recommended rate of $12 per hour, participants received $2 compensation for an estimated 10\-minute task\.
### I\.1Initial Page
![[Uncaptioned image]](https://arxiv.org/html/2606.18256v1/x5.png)
Figure 5:First page of survey\. Participants were distinguished by their Participant ID \(PID\), which ensured anonymity\.
![[Uncaptioned image]](https://arxiv.org/html/2606.18256v1/x6.png)
Figure 6:Consent form page\. Participants could download this form as a PDF\.
### I\.2Pre Chat
![[Uncaptioned image]](https://arxiv.org/html/2606.18256v1/x7.png)
Figure 7:Left:Initial state of Pre\-chat stage\.Middle:Example of an insufficient dialogue for sufficiency check\.Right:Example of a dialogue with sufficient information to end the chat\.
### I\.3Profile Selection
![[Uncaptioned image]](https://arxiv.org/html/2606.18256v1/x8.png)
Figure 8:Participants could select one of five personas to chat with\(not shown to groups NoP and NoPs\)\.
### I\.4Post Chat
![[Uncaptioned image]](https://arxiv.org/html/2606.18256v1/x9.png)
Figure 9:Post\-chat page\. Participants could chat with the agent\.
### I\.5Questionnaire
![[Uncaptioned image]](https://arxiv.org/html/2606.18256v1/x10.png)
Figure 10:Questionnaire page\. Participants could respond the questionnaire about the experience with the agent\.Similar Articles
Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
Introduces Persona Policies (PPol), a plug-and-play control layer that uses LLM-driven evolutionary program search to generate diverse, human-like user personas for evaluating LLM agents. Achieves 33–62% fitness gains over baseline, with human-likeness rated at 80.4%, and improves agent robustness with +17% task success.
PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models
PersonaArena is a dynamic simulation framework that uses a large corpus of social content and a multi-agent debating judge to evaluate and improve LLMs' ability to maintain coherent and authentic persona-level role-playing in realistic social scenarios.
Can LLMs Infer Conversational Agent Users' Personality Traits from Chat History?
ETH Zurich researchers show that fine-tuned RoBERTa models can infer users’ Big-Five personality traits from ChatGPT chat logs with up to 44 % above-random accuracy, highlighting privacy risks of conversational AI.
Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception
This paper studies how persona prompting influences language generated by multimodal large language models in urban perception, finding that captions converge while justifications vary systematically with persona attributes.
Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation
Researchers from KAIST propose a framework that uses persona-guided LLM agents to synthesize diverse harmful content for stress-testing detection systems, addressing limitations of static benchmarks such as scalability, diversity, and data contamination. Both human and LLM evaluations confirm the synthetic scenarios are harder to detect than existing benchmarks while maintaining linguistic and topical diversity.