PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue

arXiv cs.CL Tools

Summary

PersonaKit is an open-source web platform designed for rapid prototyping and user testing of diverse personas in full-duplex dialogue systems. It allows researchers to configure persona-specific turn-taking behaviors via JSON and conduct A/B surveys to evaluate sociolinguistic interactions.

arXiv:2605.06007v1 Announce Type: new Abstract: As spoken dialogue systems expand beyond traditional assistant roles to encompass diverse personas -- such as authoritative instructors, uncooperative merchants, or distracted workers -- they require distinct, human-like turn-taking behaviors to maintain psychological immersion. However, current full-duplex systems often default to a rigid, overly accommodating ``always-yield'' policy during overlapping speech, which severely undermines character consistency for non-submissive roles. Evaluating alternative, persona-specific turn-taking strategies through empirical user studies is challenging because building real-time full-duplex test environments requires substantial engineering overhead. To address this, we present PersonaKit (PK), an open-source, low-latency web platform for the rapid prototyping and evaluation of conversational agents. Using intuitive JSON configurations, researchers can define personas, specify probabilistic interruption-handling behaviors (e.g., yield, hold, bridge, or override), and automatically deploy comparative A/B surveys. Through an in-the-wild evaluation with 8 distinct personas, we demonstrate that PersonaKit provides an extensible, end-to-end framework for studying complex sociolinguistic behaviors in next-generation spoken agents.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:04 AM

# A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue
Source: [https://arxiv.org/html/2605.06007](https://arxiv.org/html/2605.06007)
Hyunbae Jeon Department of Computer Science Emory University Atlanta, GA, USA harry\.jeon@emory\.eduJinho D\. Choi Department of Computer Science Emory University Atlanta, GA, USA jinho\.choi@emory\.edu

###### Abstract

As spoken dialogue systems expand beyond traditional assistant roles to encompass diverse personas—such as authoritative instructors, uncooperative merchants, or distracted workers—they require distinct, human\-like turn\-taking behaviors to maintain psychological immersion\. However, current full\-duplex systems often default to a rigid, overly accommodating “always\-yield” policy during overlapping speech, which severely undermines character consistency for non\-submissive roles\. Evaluating alternative, persona\-specific turn\-taking strategies through empirical user studies is challenging because building real\-time full\-duplex test environments requires substantial engineering overhead\. To address this, we present PersonaKit \(PK\), an open\-source, low\-latency web platform for the rapid prototyping and evaluation of conversational agents\. Using intuitive JSON configurations, researchers can define personas, specify probabilistic interruption\-handling behaviors \(e\.g\., yield, hold, bridge, or override\), and automatically deploy comparative A/B surveys\. Through an in\-the\-wild evaluation with88distinct personas, we demonstrate that PersonaKit provides an extensible, end\-to\-end framework for studying complex sociolinguistic behaviors in next\-generation spoken agents\.

spacing=nonfrench

PersonaKit \(PK\): A Plug\-and\-Play Platform for User Testing Diverse Roles in Full\-Duplex Dialogue

Hyunbae JeonDepartment of Computer ScienceEmory UniversityAtlanta, GA, USAharry\.jeon@emory\.eduJinho D\. ChoiDepartment of Computer ScienceEmory UniversityAtlanta, GA, USAjinho\.choi@emory\.edu

††Code:[github\.com/HarryJeon24/PersonaStudyKit](https://github.com/HarryJeon24/PersonaStudyKit)Demo:[persona\-studykit\.run\.app](https://persona-studykit-mncljchxpq-uc.a.run.app/)Video:[youtu\.be/oSrmQtiM4tI](https://youtu.be/oSrmQtiM4tI)## 1Introduction

The shift of spoken dialogue systems from half\-duplex to full\-duplex\(Skantze,[2021](https://arxiv.org/html/2605.06007#bib.bib1); Maet al\.,[2025](https://arxiv.org/html/2605.06007#bib.bib2)\)moves the frontier of immersive\-agent design from text quality to*sociolinguistic behavior*: how agents manage overlapping speech\. In natural conversation, turn\-taking is heavily mediated by interpersonal roles and status\(Sackset al\.,[1974](https://arxiv.org/html/2605.06007#bib.bib3)\)\. An authoritative military instructor handles an interruption very differently than a subservient virtual assistant\(Benǔš,[2011](https://arxiv.org/html/2605.06007#bib.bib4)\)\. Yet most commercial full\-duplex systems default to an*always\-yield*policy, immediately ceding the floor on user speech and breaking immersion for non\-submissive roles\. Testing how conversational pragmatics affect user perception requires significant engineering effort across WebRTC audio, Voice Activity Detection \(VAD\), dynamic LLM prompt injection, and latency tracking\. We developedPersonaKit \(PK\)to eliminate this bottleneck\. PK contributes \(i\) an open\-source, plug\-and\-play web platform for full\-duplex persona evaluation, \(ii\) a JSON\-based mechanism making persona\-conditioned interruption policies first\-class configurable objects, and \(iii\) an end\-to\-end workflow from live deployment through auto\-generated surveys to structured log export\.

## 2Related Work

Full\-duplex spoken language models\(Maet al\.,[2025](https://arxiv.org/html/2605.06007#bib.bib2)\)have magnified the need for robust turn\-taking modeling\(Skantze,[2021](https://arxiv.org/html/2605.06007#bib.bib1)\), but classic systematics\(Sackset al\.,[1974](https://arxiv.org/html/2605.06007#bib.bib3)\)remain hard to integrate into LLM\-driven agents\. Persona modeling, meanwhile, has a long history in dialogue\(Zhanget al\.,[2018](https://arxiv.org/html/2605.06007#bib.bib5)\), yet existing evaluations are predominantly text\-based, ignoring the acoustic pragmatics of dominance, yielding, and barge\-in recovery\. PK bridges these two threads by exposing turn\-taking strategy itself as a persona parameter and providing the infrastructure to evaluate it in live voice interaction\.

Client \(Web Browser\)Server \(Flask \+ Socket\.IO\)Researcher\-Facing ConfigsMic \(WebRTC\)Client VAD\(volume gate\)Halt Playbackon barge\-inCutoff TrackerSpeaker OutspeechASRZero\-Shot IntentClassifierTurn\-Taking ManagerYield / Resume / Bridge / OverrideLLM GenerationTTS Synthesisintentpromptpersona\.jsoninterruption\_config\.jsonmodel\_config\.jsonAPI keys\(Secret Mgr\)Tag Parser\[EXIT\]session\_config\.jsonAuto A/BSurveypersonamatrixmodelexitquestionsaudio blobB64 audio

Figure 1:PersonaKit architecture\. The browser performs client\-side VAD and tracks the exact*cutoff text*on barge\-in\. The Flask server classifies the intent of the user’s interruption, then the Turn\-Taking Manager selects a strategy \(Yield / Resume / Bridge / Override\) by readingpersona\.jsonandinterruption\_config\.json\.model\_config\.jsonroutes generation and TTS to the chosen providers; API keys are held in a secrets store\. All experimental behavior is configured through JSON—no source code is modified\.
## 3System Architecture

PersonaKit isolates low\-latency audio engineering from experimental design: researchers interact with the platform entirely through a web dashboard and four JSON files—persona\.json\(scenario, role, opening prompt\),interruption\_config\.json\(strategy matrix\),session\_config\.json\(survey\), andmodel\_config\.json\(LLM/TTS routing\)\. The stack is open source \(Python/Flask \+ vanilla JS\), so researchers can clone PK, run it locally, or swap in new providers\.

### 3\.1Client\-Side VAD and Audio Tracking

The frontend uses WebRTC for microphone capture with a client\-side VAD node that halts local playback on barge\-in\. PK tracks byte\-level playback to log the*cutoff text*\(what the bot vocalized before being interrupted\) and the*remaining text*it still intended to say; both return to the server so the LLM knows precisely where it was interrupted\.

### 3\.2Turn\-Taking as a Persona Tool

When interrupted, the server transcribes the user’s utterance and classifies its intent into four categories grounded in prior phonetic and conversation\-analytic work:Competitive\(seeks the floor to contradict or override\),Cooperative\(adds information without derailing\),Topic Change\(pivots subject\), andBackchannel\(short affirmations, not floor\-taking\)\. Researchers define a strategy matrix ininterruption\_config\.jsonmapping these intents to four actions \(Yield,Resume/Hold,Bridge,Override\) with probability weights—e\.g\., a dominant persona might weight Competitive interruptions as 50% Resume, 25% Override, 15% Bridge, 10% Yield, while a cooperative persona inverts these\.

#### How probabilities drive generation\.

The matrix is applied*before*generation\. On each interruption, the Turn\-Taking Manager reads the intent, samples an action from its categorical distribution, and injects that action as a control token into the LLM’s system prompt \(e\.g\., “\[STRATEGY=RESUME\]: finish your previous sentence, ignoring the user”\), so the LLM generates*conditioned on*a pre\-committed action\. The Autonomous condition \(Style C\) skips sampling and lets the LLM choose its own strategy zero\-shot\.

### 3\.3Automated Lifecycle and Data Export

Sessions end on aMAX\_TURNScap or a verbalTERMINATEintent; the LLM emits an in\-character farewell with a hidden\[EXIT\]tag that triggers the post\-session survey\. Each session exports the dialogue transcript, an event log with per\-turn intent, strategy, and cutoff/remaining text, and all survey responses as JSON or CSV\.

Table 1:Persona catalog mapped to the Interpersonal Circumplex\(Wiggins,[1979](https://arxiv.org/html/2605.06007#bib.bib7)\)\.

## 4Pilot User Study

#### Study Design\.

Five participants \(N=5N\{=\}5\) completed the full study, engaging with88occupational personas balanced across the Interpersonal Circumplex \(Table[1](https://arxiv.org/html/2605.06007#S3.T1)\), yielding120120dialogue sessions\. For each persona, users experienced three randomized within\-subject conditions:Style A\(Always\-Yield baseline\),Style B\(Probabilistic, JSON\-tuned strategy weights\), andStyle C\(Autonomous, where the LLM selects a strategy zero\-shot from the persona prompt\)\. Condition order was randomized per persona and the underlying LLM and voice were held fixed across styles\.

#### Evaluation Metrics\.

After each persona, participants completed a comparative Likert survey on\{−1,0,\+1\}\\\{\-1,0,\+1\\\}rating Reaction Naturalness \(*“felt human\-like and natural”*;Bartnecket al\.,[2009](https://arxiv.org/html/2605.06007#bib.bib8)\), Persona Consistency \(*“remained consistent with the role”*;Gomeset al\.,[2013](https://arxiv.org/html/2605.06007#bib.bib6)\), and Interaction Fluidity \(*“turn transitions felt smooth”*;Skantze,[2021](https://arxiv.org/html/2605.06007#bib.bib1)\), followed by a forced\-choice preference item and a free\-text justification\. Default end\-to\-end barge\-in latency under our OpenAI/ElevenLabs configuration was∼\\sim1–2 s\.

Table 2:Mean Likert ratings on\{−1,0,\+1\}\\\{\-1,0,\+1\\\}for Reaction Naturalness, Persona Consistency, and Interaction Fluidity, plus forced\-choice preference \(Pref\. %\), by Interpersonal Circumplex quadrant and interruption style \(N=5N\{=\}5completers, 2 personas per quadrant, 10 ratings per cell\)\. Best entry per quadrant inbold\.
#### Results\.

Table[2](https://arxiv.org/html/2605.06007#S4.T2)reports per\-quadrant means across five completers \(10 ratings per cell\)\. Two patterns emerge\.High\-agency personas \(Q1\)appear to benefit from non\-yielding strategies: Reaction Naturalness rises from0\.200\.20\(Yield\) to0\.600\.60\(Probabilistic\), and60%60\\%of forced\-choice votes favored Autonomous\.Low\-agency, high\-communion personas \(Q3\)tended to favor yielding, with70%70\\%preferring Always\-Yield\. Q2 preferred Probabilistic \(50%50\\%\), and Q4 preferred Yield \(50%50\\%\) yet reached its highest naturalness \(0\.670\.67\) under Probabilistic\. Per\-persona logs are released with the repository for finer\-grained analysis\.

Table 3:Sample qualitative feedback auto\-collected by PK’s survey engine, drawn verbatim from the exported study logs\.
#### Emergent Interruption Behaviors\.

Table[3](https://arxiv.org/html/2605.06007#S4.T3)shows participant free\-text feedback exported automatically\. The raw logs further reveal persona\-consistent behaviors that Always\-Yield would erase\. In one Drill Sergeant session under Probabilistic, the bot’s intended line was “*Louder, recruit\! I can’t hear you over your weakness\! Repeat it again\!*”; the user cut it off at “*Repeat it*”, leaving remaining text “*again\!*”\. Classified asCOMPETITIVEand sampled asRESUME, the bot finished with “*…again\!*”—a coherent barge\-in recovery that Always\-Yield would have dropped entirely\.

## 5Demonstration Scenarios

At SIGDIAL, attendees experience PK from both sides\. They watch the dashboard re\-route turn\-taking logic by uploading a new JSON, then speak through the laptop’s microphone \(or their own phone; see Figure[2](https://arxiv.org/html/2605.06007#S5.F2)\) and try to interrupt a*Grumpy Tavern Keeper*\(configured to hold the floor\) versus a*Standard AI Assistant*\(configured to yield\)\. Swapping between the two characters without a line of code change lets attendees directly feel how interruption policy reshapes perceived role realism even when the underlying LLM is unchanged\.

![Refer to caption](https://arxiv.org/html/2605.06007v1/x1.png)

\(a\) Live dialogue view

![Refer to caption](https://arxiv.org/html/2605.06007v1/x2.png)

\(b\) Auto\-deployed survey

Figure 2:PersonaKit runs on both desktop and mobile\.\(a\)The participant view shows turns from each side, persona and style labels, and live VAD status\.\(b\)The post\-session comparative survey is generated automatically fromsession\_config\.json\.
## 6Use Cases Beyond This Study

While our evaluation targets persona\-conditioned interruption, PK is a general\-purpose testbed for full\-duplex dialogue research\.Persona prototyping:researchers can iterate on the persona prompt, turn\-taking matrix, and scenario inpersona\.jsonand immediately run a live user study—the primary workflow the tool is built for\.Custom surveys:session\_config\.jsonaccepts arbitrary Likert, forced\-choice, and free\-text banks for other constructs \(e\.g\., trust, task success\)\.Model comparison:because routing lives inmodel\_config\.json, LLM vendors, voices, or local open\-weight models can be swapped while persona and policy are held fixed\.Data collection:the event log pairs each interruption with its intent, sampled strategy, and follow\-up utterance, a ready seed set for supervised barge\-in policies or RLHF reward models\.

## 7Limitations

Our pilot \(N=5N\{=\}5\) is descriptive, not inferential; larger samples and cross\-demographic replication are needed before stronger claims about circumplex\-to\-strategy mappings can be made\. Intent classification relies on a zero\-shot LLM prompt and was not independently validated against human labels, so it can mislabel ambiguous back\-channels under noisy acoustics\. The four\-action vocabulary \(Yield, Resume, Bridge, Override\) also excludes fine\-grained prosodic cues such as pitch reset, latching, and gaze—a deliberate tradeoff favoring configurability over acoustic fidelity\.

## 8Conclusion

PersonaKit exposes turn\-taking as a JSON\-configurable persona parameter and automates the full study lifecycle from recruitment to export\. Our pilot study \(N=5N\{=\}5\) suggests that preferred turn\-taking policies may vary with persona role, illustrating PK’s usefulness as a testbed for studying such effects\. PK is open source and ready for community extension\.

## References

- Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots\.International Journal of Social Robotics1\(1\),pp\. 71–81\.Cited by:[§4](https://arxiv.org/html/2605.06007#S4.SS0.SSS0.Px2.p1.2)\.
- Š\. Benǔš \(2011\)Pragmatic aspects of temporal accommodation in turn\-taking\.Journal of Pragmatics43\(12\),pp\. 3001–3027\.Cited by:[§1](https://arxiv.org/html/2605.06007#S1.p1.1)\.
- P\. Gomes, C\. Martinho, and A\. Paiva \(2013\)Metrics for character believability in interactive narrative\.InInteractive Storytelling,pp\. 92–103\.Cited by:[§4](https://arxiv.org/html/2605.06007#S4.SS0.SSS0.Px2.p1.2)\.
- Z\. Ma, Y\. Song, C\. Du, J\. Cong, Z\. Chen, Y\. Wang, Y\. Wang, and X\. Chen \(2025\)Language model can listen while speaking\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2605.06007#S1.p1.1),[§2](https://arxiv.org/html/2605.06007#S2.p1.1)\.
- H\. Sacks, E\. A\. Schegloff, and G\. Jefferson \(1974\)A simplest systematics for the organization of turn\-taking for conversation\.Language50\(4\),pp\. 696–735\.Cited by:[§1](https://arxiv.org/html/2605.06007#S1.p1.1),[§2](https://arxiv.org/html/2605.06007#S2.p1.1)\.
- G\. Skantze \(2021\)Turn\-taking in conversational systems and human\-robot interaction: a review\.Computer Speech & Language67,pp\. 101178\.Cited by:[§1](https://arxiv.org/html/2605.06007#S1.p1.1),[§2](https://arxiv.org/html/2605.06007#S2.p1.1),[§4](https://arxiv.org/html/2605.06007#S4.SS0.SSS0.Px2.p1.2)\.
- J\. S\. Wiggins \(1979\)A psychological taxonomy of trait\-descriptive terms: the interpersonal domain\.Journal of Personality and Social Psychology37\(3\),pp\. 395–412\.Cited by:[Table 1](https://arxiv.org/html/2605.06007#S3.T1)\.
- S\. Zhang, E\. Dinan, J\. Urbanek, A\. Szlam, D\. Kiela, and J\. Weston \(2018\)Personalizing dialogue agents: I have a dog, do you have pets too?\.InProceedings of ACL,Cited by:[§2](https://arxiv.org/html/2605.06007#S2.p1.1)\.

## Equipment Requirements for Demonstration

PersonaKit runs in any modern browser on a laptop or phone, so requirements for the demo are minimal\.

#### Equipment provided by authors\.

A laptop with built\-in microphone and speakers, running PK via its cloud\-deployed instance; a phone as a secondary client to show mobile compatibility; and a pair of over\-ear headphones that attendees can use if the demo area is noisy\.

#### Furniture and equipment requested from organizers\.

A standard demo table \(one attendee at a time, plus a presenter\), two power outlets, and reliable conference Wi\-Fi\. No projector or external monitor is needed\.

#### Interaction flow\.

The demo is open to all attendees: each walks up, speaks with the agent through the provided laptop or phone, and \(on request\) switches between a*Grumpy Tavern Keeper*and a*Standard AI Assistant*to experience how persona\-conditioned turn\-taking changes the feel of the interaction\. A consent notice is shown before each session, and any transcripts retained briefly for on\-site walkthroughs are deleted at the end of the demo\.

Similar Articles

Dynamic In-Group Persona Generation for Enhancing Human-AI Rapport

arXiv cs.AI

This paper introduces a method for LLM-based chatbots to dynamically generate in-group personas by first identifying a user's primary concern and then creating a synthetic persona that shares that concern. A human-subject study demonstrates significant improvements in perceived rapport and user engagement compared to baseline conditions.

Towards Customized Multimodal Role-Play

arXiv cs.LG

This paper introduces UniCharacter, a two-stage training framework for Customized Multimodal Role-Play (CMRP) that enables unified customization of persona, dialogue style, and visual identity. It presents the RoleScape-20 dataset and demonstrates that the model can achieve coherent cross-modal generation with minimal data.

i was tired of voice onboarding, so made it faster.

Reddit r/AI_Agents

The author developed a portable user preference profile system that integrates with ElevenLabs and Pipecat agents, allowing voice assistants to remember user styles and interests across different platforms to skip redundant onboarding.