Towards Customized Multimodal Role-Play

arXiv cs.LG Papers

Summary

This paper introduces UniCharacter, a two-stage training framework for Customized Multimodal Role-Play (CMRP) that enables unified customization of persona, dialogue style, and visual identity. It presents the RoleScape-20 dataset and demonstrates that the model can achieve coherent cross-modal generation with minimal data.

arXiv:2605.08129v1 Announce Type: new Abstract: Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 06:46 AM

# Towards Customized Multimodal Role-Play
Source: [https://arxiv.org/html/2605.08129](https://arxiv.org/html/2605.08129)
Jianzong WuQingyu ShiYe TianAixi ZhangHao JiangJiangning ZhangYunhai Tong

###### Abstract

Unified multimodal understanding and generation models enable richer human\-AI interaction\. Yet jointly customizing a character’s persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored\. To mitigate this gap, we introduce a new task, Customized Multimodal Role\-Play \(CMRP\)\. We construct theRoleScape\-20 datasetcomprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text–image interactions\. Building on a unified model, we deviseUniCharacter, a two\-stage training framework containing Unified Supervised Finetuning \(Unified\-SFT\) and character\-specific group relative policy optimization \(Character\-GRPO\)\. Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images\. This process takes about 100 GPU hours\. Experiments on the RoleScape\-20 dataset show that the proposed method substantially outperforms prior approaches\. Ablation studies further validate the effectiveness of our cross\-modal consistency design and few\-shot customization strategy\. We argue that CMRP, coupled with unified modeling, provides a basis for next\-generation characterful and immersive interactive agents\. Our dataset and code will be released at:[https://github\.com/Tangc03/UniCharacter](https://github.com/Tangc03/UniCharacter)

Machine Learning, ICML

![[Uncaptioned image]](https://arxiv.org/html/2605.08129v1/x1.png)

Figure 1:Demonstration of the UniCharacter model’s capabilities\.The model utilizes a character’s profile to maintain consistency across several integrated tasks\. The core innovation is showcased in Multimodal Role Play, where the model simultaneously generates a coherent textual response and a corresponding visual image that reflects the character’s emotion\. This unified generation is supplemented by the model’s ability to perform Text\-to\-Image \(T2I\) Generation, Knowledge Question\-Answering \(Knowledge QA\), and Visual Question\-Answering \(VQA\)\. Together, these functions highlight UniCharacter’s ability to create a cohesive, interactive, and visually embodied persona within a single framework\.
## 1Introduction

Personalized virtual characters are increasingly used in digital avatars, interactive entertainment, and human–AI communication\. Existing systems usually operate in asingle modality\. Text\-based models\(Wanget al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib10); Shaoet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib11); Nguyenet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib17)\)can be customized for persona\-aligned role\-play but cannot generate visual content\. Image personalization methods\(Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22); Galet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib23); Zenget al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib27)\)can reproduce a character’s appearance, but cannot participate in conversations or react to contextual cues\. Current approaches can only customize how a character speaks or looks, but not both at the same time\.

Recent unified multimodal foundation models\(Chenet al\.,[2025b](https://arxiv.org/html/2605.08129#bib.bib37); Denget al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib31); Yanget al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib49); Xieet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib35)\)offer a promising way to bridge this gap\. These models process and generate both text and images within a single architecture, and they already demonstrate strong cross\-modal understanding and generation capabilities\. They could support virtual characters that are both linguistically expressive and visually creative\. Yet, current cases of these models\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51); Nguyenet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib50)\)focus on tasks such as Visual Question Answering \(VQA\), captioning, or general text–to\-image generation\. None of them targets persona\-driven interaction that requires consistent language style, emotional expression, and visual identity\. Neglecting identity consistency across modalities prevents models from effectively constructing a complete multimodal character\. This is also a crucial foundation for achieving more immersive user\-character interactions in role\-playing scenarios, holding significant application potential\.

To address this gap, we introduceCustomized Multimodal Role\-Play \(CMRP\), a task that adapts a general\-purpose multimodal model into a virtual character using minimal character\-specific data—a textual profile, a few reference images, and example dialogues—to generate in\-character responses and appearance\-consistent images for interactive role\-play with a stable persona and visual identity\.

To facilitate CMRP, we introduce RoleScape\-20, the first multimodal role\-play dataset\. It has 20 diverse characters, each with a textual profile, 5–15 reference images, and 150–250 role\-playing dialogues\. We also provide fine\-grained multimodal annotations, including explicit thinking processes, image\-generation instructions, and paired visual or knowledge\-based QA samples\. These components support unified modeling of persona, language, and visual identity\.

Building on this dataset, we proposeUniCharacter, a framework that adapts unified multimodal models to coherent multimodal role\-play via a two\-stage pipeline\. Stage 1 performs Unified Supervised Fine\-Tuning \(Unified\-SFT\) across all tasks\. However, image generation SFT relies on ground\-truth images, which limits scaling and leads to overfitting and low output diversity\. Therefore, Stage 2 introduces Character Group Relative Policy Optimization \(Character\-GRPO\) for text\-to\-image \(T2I\) generation, hoping that its group\-based sampling pipeline can encourage the model to explore diverse visual representations, and its training data requirement, with no ground truth images needed, can further expand the diversity of image generation scenarios\. We apply GRPO training to the T2I generation task\. By using rewards for text\-image alignment, group diversity, and a penalty for similarity to training images, our Character\-GRPO training stage effectively enhances the diversity of model outputs in image generation tasks\.

Extensive experiments demonstrate that UniCharacter surpasses competitive baselines \(e\.g\., UniCTokens\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\), DreamBooth\(Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22)\), Qwen2\.5\-VL\(Baiet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib53)\)\) in role consistency, dialogue authenticity, image fidelity, and cross\-modal alignment, advancing the creation of coherent, lifelike virtual agents\.

[Table1](https://arxiv.org/html/2605.08129#S1.T1)summarizes the differences between UniCharacter and recent works, and our contributions are as follows:

- •We introduce CMRP, a new task for multimodal role\-play that integrates both textual role\-playing and multimodal personalization, along with RoleScape\-20, the first dataset designed for multimodal role\-play\.
- •We propose UniCharacter, a two\-stage framework comprising Unified\-SFT and Character\-GRPO, for few\-shot vision\-language alignment\. Character\-GRPO employs a reward mechanism to mitigate T2I overfitting while preserving text\-image consistency\.
- •Extensive experiments show that our approach outperforms baselines in role consistency, dialogue quality, image fidelity, and cross\-modal alignment\.

Table 1:Comparison between UniCharacter and recent works\.MethodText Role\-PlayMultimodal Role\-PlayT2I GenerationKnowledge QAVQACharacterLLM\(Shaoet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib11)\)✓✗✗✓✗DreamBooth\(Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22)\)✗✗✓✗✗Yo’LLaVA\(Nguyenet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib17)\)✗✗✗✗✓MyVLM\(Alalufet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib19)\)✗✗✗✗✓Yo’Chamaleon\(Nguyenet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib50)\)✗✗✓✗✓UniCTokens\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\)✗✗✓✗✓UniCharacter \(Ours\)✓✓✓✓✓

## 2Related Work

Table 2:Comparison between RoleScape\-20 and related datasets\.Multimodal Role\-Play Datadenotes paired visual\-textual role\-playing episodes where images depict the character in a context that aligns with their dialogue, emotional state, and personality\.Charrefers to character\.Imgrefers to Image\.DatasetModality\#Chars\#Img/Char\#Dialogues/Char\#VQA/Char\#QA/CharAnnotationsRole\-PlayDialogueKnowledgeQAVQACharacterImagesMultimodalRole\-PlayDataThinkingProcessCharacterLLM\(Shaoet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib11)\)Text9–1\.6K\-\-✓✗✗✗✗✗ChatHaruhi\-54K\(Liet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib9)\)Text32–1\.7K\-\-✓✗✗✗✗✗DreamBooth\(Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22)\)Image303–5\-\-\-✗✗✗✓✗✗Yo’LLaVA\(Nguyenet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib17)\)Image\+Text405\-10\-~4\-✗✗✓✓✗✗MyVLM\(Alalufet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib19)\)Image\+Text3010\-90\-✗✗✓✓✗✗UnifyBench\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\)Image\+Text205\-10\-~200\-✗✗✓✓✗✗RoleScape\-20 \(Ours\)Image\+Text205\-15150\-250~200~100✓✓✓✓✓✓

Customized Generation\.Customized generation generates content that follows the role specified by the user, represented through text\(Xuet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib13); Wanget al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib12); Chenet al\.,[2025a](https://arxiv.org/html/2605.08129#bib.bib15)\)or images\(Galet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib23); Guoet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib24); Kumariet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib26); Wuet al\.,[2025b](https://arxiv.org/html/2605.08129#bib.bib30),[2024](https://arxiv.org/html/2605.08129#bib.bib29),[2023](https://arxiv.org/html/2605.08129#bib.bib28)\)\. Previous customization methods can be broadly divided into two categories: training\-based\(Yeet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib25); Zenget al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib27)\)and tuning\-based\(Wanget al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib10); Shaoet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib11); Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22); Shiet al\.,[2025b](https://arxiv.org/html/2605.08129#bib.bib42)\)approaches\. Training\-based methods introduce an extra module to encode the user input and guide the generation process\. Tuning\-based methods, on the other hand, finetune part of the model parameters to learn the user\-provided role\. They use a special token during finetuning and insert it at inference time for customization, achieving strong role fidelity and controllability\.

However, existing methods\(Liet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib9); Nguyenet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib17); Anet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib18)\)are limited to a single modality\. During inference, the model customizes through text or images only\(Alalufet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib19); Ohet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib20); Haoet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib21); Shiet al\.,[2025c](https://arxiv.org/html/2605.08129#bib.bib43)\), making it difficult to support interactions requiring both outputs\. To address this, we propose Customized Multimodal Role\-Play, a new task requiring joint text and image generation from user inputs, and introduce UniCharacter, a tuning\-based method for this setting\.

Customized Unified Multimodal Models\.Unified models now integrate multimodal understanding and generation\(Denget al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib31); Wuet al\.,[2025a](https://arxiv.org/html/2605.08129#bib.bib34); Xieet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib36); Chenet al\.,[2025b](https://arxiv.org/html/2605.08129#bib.bib37); Cuiet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib45); Shiet al\.,[2025a](https://arxiv.org/html/2605.08129#bib.bib48); Yanget al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib49)\), yet coherent personalization remains challenging\. While Yo’Chameleon\(Nguyenet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib50)\)uses disjoint strategies and UniCTokens\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\)tackles unified personalization, neither supports complex interactive scenarios\. We address this with a framework that jointly models key dimensions, including persona, dialogue style, visual identity, and emotion\. Our approach extends unified models to complex personalized interactions and enables agents with consistent personalities and cross\-modal coherence\. Meanwhile, Group Relative Policy Optimization \(GRPO\) has gained traction as an RL\-based tuning method since DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib59)\), with work adapting it to unified models\(Jianget al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib62); Maoet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib63)\)or flow\-matching image generators\(Liuet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib64); Zhenget al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib65)\)\. These works are based on general tasks and datasets, so we bridge these directions by incorporating GRPO into the rectified flow\-based image generation branch of unified multimodal models, using a tailored reward function for CMRP to increase generation diversity while maintaining image quality and text\-image alignment\.

## 3The RoleScape\-20 Dataset

![Refer to caption](https://arxiv.org/html/2605.08129v1/x2.png)Figure 2:Data Construction Pipeline of RoleScape\-20 Dataset\.The data construction pipeline processes raw character materials \(dialogues, images, profiles\) into diverse training data, including multimodal role\-play dialogues, T2I generation pairs, knowledge QA, and VQA pairs\.### 3\.1Problem Formulation

We define the core task as Customized Multimodal Role\-Play \(CMRP\), which aims to develop a computational agent that can faithfully emulate a specific virtual character based on a comprehensive character definition\. A specific character is defined by a triplet𝒞=\{𝒫char,ℐcore,𝒟ref\}\\mathcal\{C\}=\\\{\\mathcal\{P\}\_\{\\text\{char\}\},\\mathcal\{I\}\_\{\\text\{core\}\},\\mathcal\{D\}\_\{\\text\{ref\}\}\\\}\.𝒫char\\mathcal\{P\}\_\{\\text\{char\}\}is the textual profile, describing the character’s personality, background, and traits\.ℐcore\\mathcal\{I\}\_\{\\text\{core\}\}is a set of core reference images that define the character’s visual identity\.𝒟ref\\mathcal\{D\}\_\{\\text\{ref\}\}is a collection of reference dialogues capturing the character’s unique speaking style and linguistic habits\.

Given the character definition𝒞\\mathcal\{C\}and a user’s textual queryQuQ\_\{u\}, the CMRP task requires the modelℱθ\\mathcal\{F\}\_\{\\theta\}to generate a multimodal response pair\(Rm,Im\)\(R\_\{m\},I\_\{m\}\), which must satisfy two key constraints:RmR\_\{m\}must follow the personality and speaking style defined in𝒫char\\mathcal\{P\}\_\{\\text\{char\}\}and𝒟ref\\mathcal\{D\}\_\{\\text\{ref\}\}andImI\_\{m\}must accurately depict the character’s visual features as specified inℐcore\\mathcal\{I\}\_\{\\text\{core\}\}while being contextually relevant toRmR\_\{m\}andQuQ\_\{u\}\. Formally, this interaction is represented as:\(Rm,Im\)=ℱθ​\(Qu\)\(R\_\{m\},I\_\{m\}\)=\\mathcal\{F\}\_\{\\theta\}\(Q\_\{u\}\)\.

Conceptually, this output is a sample from a conditional joint probability distribution shaped by the character dataset:\(Rm,Im\)∼P​\(R,I\|Qu,𝒞;θ\)\(R\_\{m\},I\_\{m\}\)\\sim P\(R,I\|Q\_\{u\},\\mathcal\{C\};\\theta\)\. This joint probability can be decomposed via chain rule into two sequential stages: text generationRm∼P​\(R\|Qu,𝒞;θ\)R\_\{m\}\\sim P\(R\|Q\_\{u\},\\mathcal\{C\};\\theta\)followed by conditional image generationIm∼P​\(I\|Rm,Qu,𝒞;θ\)I\_\{m\}\\sim P\(I\|R\_\{m\},Q\_\{u\},\\mathcal\{C\};\\theta\)\.

The model integrates four core CMRP capabilities\.

Multimodal Role\-Play\.This is the primary task where the model acts as the character\. Given a user queryQuQ\_\{u\}, the model generates a text\-image response pair:\(Rm,Im\)∼P​\(R,I∣Qu,𝒞;θ\)\(R\_\{m\},I\_\{m\}\)\\sim P\(R,I\\mid Q\_\{u\},\\mathcal\{C\};\\theta\)\. The response must maintain high persona consistency in linguistic style and visual identity\.

Text\-to\-Image \(T2I\) Generation\.This capability focuses on the model’s ability to translate a textual instruction or scene descriptionCjC\_\{j\}into a high\-quality consistent imageIjI\_\{j\}that adheres to the character’s visual identity inℐcore\\mathcal\{I\}\_\{\\text\{core\}\}, modeled as:Ij∼P​\(I∣Cj,𝒞;θ\)I\_\{j\}\\sim P\(I\\mid C\_\{j\},\\mathcal\{C\};\\theta\)\.

Visual Question Answering \(VQA\)\.VQA evaluates the model’s understanding of the character’s visual attributes\. Given a reference imageIref∈ℐcoreI\_\{\\text\{ref\}\}\\in\\mathcal\{I\}\_\{\\text\{core\}\}and a specific questionQvQ\_\{v\}regarding its details, the model must provide an accurate textual answerAvA\_\{v\}:Av∼P​\(A∣Iref,Qv,𝒞;θ\)A\_\{v\}\\sim P\(A\\mid I\_\{\\text\{ref\}\},Q\_\{v\},\\mathcal\{C\};\\theta\)\.

Knowledge Question Answering \(Knowledge QA\)\.This task requires the model to recall and reason over the character’s background information\. Given a textual questionQkQ\_\{k\}about the character’s life, traits, or history, the model retrieves the answerAkA\_\{k\}from𝒫char\\mathcal\{P\}\_\{\\text\{char\}\}:Ak∼P​\(A∣Qk,𝒞;θ\)A\_\{k\}\\sim P\(A\\mid Q\_\{k\},\\mathcal\{C\};\\theta\)\.

### 3\.2Dataset Construction

Overview\.Existing general\-purpose image\-text dialogue datasets are insufficient for the demands of deep character customization\. To address this, we construct RoleScape\-20, a new dataset specifically designed for the CMRP task\. It comprises 20 diverse characters organized into three main categories: nine real\-world figures, mostly from movies and TV series, seven anime and game characters, and four animals\. The raw materials for our dataset are sourced from various channels to ensure richness and authenticity\. Images are collected from real photographs and high\-resolution screenshots from films, television shows, games, and anime\. Dialogues are compiled from authentic conversations found online and further supplemented and stylized using Large Language Models \(LLMs\) to align with character personas\. Character profiles are sourced from authoritative sources like Wikipedia for real figures or generated by LLMs based on established lore for fictional characters\. All materials undergo manual inspection and screening\.

![Refer to caption](https://arxiv.org/html/2605.08129v1/x3.png)Figure 3:Overview of the UniCharacter framework\.Stage 1 focuses on Unified\-SFT, using MSE loss for image outputs and CE loss for text outputs\. Stage 2 implements Character\-GRPO, optimizing the policyπθ\\pi\_\{\\theta\}via a multi\-reward mechanism that considers both text\-image alignment and generation diversity\.Comparison with Related Datasets\.RoleScape\-20 fills a critical gap in the existing landscape of role\-playing and customization datasets\. Compared to text\-only role\-playing datasets like Character\-LLM\(Shaoet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib11)\), and ChatHaruhi\-54K\(Liet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib9)\), our dataset introduces the essential visual modality required for training and evaluating multimodal consistency\. In contrast to personalized multimodal datasets such as Yo’LLaVA\(Nguyenet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib17)\), MyVLM\(Alalufet al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib19)\), and UnifyBench\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\), which often lack deep, in\-character conversations, RoleScape\-20 provides rich, personality\-driven dialogues instead of simple image descriptions\. Furthermore, unlike image customization datasets such as DreamBooth\(Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22)\), which focus narrowly on visual generation from simple, standardized captions, our dataset provides complex, multifaceted textual annotations, including conversational context, reasoning processes, and character knowledge\. RoleScape\-20 is the first dataset to provide this comprehensive suite of annotations, including fine\-grained generation instructions, thinking processes, and both knowledge\-based and visual question\-answering pairs, establishing a solid foundation for training truly deep and consistent multimodal role\-playing models\. A more detailed comparison with previous datasets is presented in[Table2](https://arxiv.org/html/2605.08129#S2.T2)\.

Construction Pipeline\.We designed a systematic annotation pipeline, shown in[Figure2](https://arxiv.org/html/2605.08129#S3.F2), to process the raw materials into a multi\-faceted dataset capable of comprehensively training all the required model capabilities\. The process consists of four main stages\. First, to extend the amount of dialogues, we use Qwen3\(Team,[2025](https://arxiv.org/html/2605.08129#bib.bib52)\)LLM to expand upon the initial reference dialogues \(𝒟ref\\mathcal\{D\}\_\{\\text\{ref\}\}\) and character profile \(𝒫char\\mathcal\{P\}\_\{\\text\{char\}\}\), generating 150\-200 new dialogue samples that faithfully mimic the character and produce the final text\-only dialogue set,𝒟dialogue\\mathcal\{D\}\_\{\\text\{dialogue\}\}\. Second, for multimodal role\-play and T2I generation data annotation, we take a core image \(IjI\_\{j\}\) and its corresponding dialogue pair\(Qu,Rm\)\(Q\_\{u\},R\_\{m\}\)as input\. Using GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2605.08129#bib.bib56)\), we generate two crucial annotations: a “Thinking Process” that guides the image generation process, and a clear “Instruction” to guide image generation for that specific context, resulting in a richly annotated data tuple\(Ij,Qu,Rm,Thinking Process,Instruction\)\(I\_\{j\},Q\_\{u\},R\_\{m\},\\text\{Thinking Process\},\\text\{Instruction\}\)\. Third, to construct the knowledge QA Dataset \(𝒟kqa\\mathcal\{D\}\_\{\\text\{kqa\}\}\), we employ an LLM to extract key information from the character profile \(𝒫char\\mathcal\{P\}\_\{\\text\{char\}\}\) and automatically convert it into question\-answer pairs\. Finally, for the visual QA Dataset \(𝒟vqa\\mathcal\{D\}\_\{\\text\{vqa\}\}\), we use the Qwen3\-VL\(Team,[2025](https://arxiv.org/html/2605.08129#bib.bib52); Baiet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib55); Wanget al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib54); Baiet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib53)\)multimodal model, providing it with a core image \(IjI\_\{j\}\) and the character profile for context, to generate approximately 20 question\-answer pairs focused on specific visual details within the image\. All annotations are manually verified to confirm their authenticity\. Data construction details are in the Appendix[AppendixF](https://arxiv.org/html/2605.08129#A6)\.

## 4Method

An overview of UniCharacter is shown in[Figure3](https://arxiv.org/html/2605.08129#S3.F3)\. Detailed preliminaries are in the Appendix[AppendixB](https://arxiv.org/html/2605.08129#A2)\.

Table 3:Comparison with previous works on RoleScape\-20 benchmark\.TP = Text Prompt\. Higher values are better\.BestandSecond bestperformances are highlighted\.MethodModelSizeText\-based Role PlayT2IMultimodal Role\-Play \(T2T2I\)Knowledge QAVQAMemorizationPersonalityDiversityCLIP\-ICLIP\-TDINOCLIP\-ICLIP\-TDINODreamBooth\(Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22)\)7B\-\-\-0\.860\.300\.88\-\-\-\-\-Qwen2\.5\-VL\+TP\(Baiet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib53)\)7B5\.135\.175\.60\-\-\-\-\-\-0\.750\.81UniCTokens\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\)1\.3B2\.432\.542\.300\.730\.340\.830\.510\.170\.700\.080\.21UniCharacter\(Ours\)7B5\.456\.556\.100\.880\.330\.910\.860\.330\.890\.770\.84

### 4\.1Unified Supervised Finetuning

We frame the unified\-SFT process as a multi\-task learning problem, which is divided into two task categories\.

Finetuning for Text Generation\.This category of tasks, which we refer to as Vision\-Language Understanding tasks, is designed to enhance the model’s ability to comprehend and express the character’s non\-visual attributes\. It includes four sub\-tasks: Role\-Play Chatting, where the model learns to generate in\-character text responses; Thinking Task, where it learns to generate a reasoning process of character image generation; Visual Question Answering \(VQA\), where it answers questions based on a character image; and Knowledge Question Answering \(Knowledge QA\), where it answers questions about the character\. The optimization objective for each of these tasks is to maximize the conditional probability of the target text, and the loss for each \(ℒchat,ℒthink,ℒvqa,ℒkqa\\mathcal\{L\}\_\{\\text\{chat\}\},\\mathcal\{L\}\_\{\\text\{think\}\},\\mathcal\{L\}\_\{\\text\{vqa\}\},\\mathcal\{L\}\_\{\\text\{kqa\}\}\) is calculated using a standard Cross\-Entropy \(CE\) Loss\. The total loss for this category is a weighted sum of the individual task losses:

ℒVLM=ℒchat\+ℒthink\+ℒvqa\+ℒkqa\\mathcal\{L\}\_\{\\text\{VLM\}\}=\\mathcal\{L\}\_\{\\text\{chat\}\}\+\\mathcal\{L\}\_\{\\text\{think\}\}\+\\mathcal\{L\}\_\{\\text\{vqa\}\}\+\\mathcal\{L\}\_\{\\text\{kqa\}\}\(1\)
Finetuning for Image Generation\.The T2I generation task generates imageIgenI\_\{\\text\{gen\}\}based on textCC\. We use a Rectified Flow\-based approach, where the loss function is the mean squared error \(MSE\) on the noise\-to\-clean residual\.

### 4\.2Character\-GRPO

Although the Unified SFT stage enables the model to perform well in textual dialogue, the Text\-to\-Image \(T2I\) branch often suffers from visual overfitting, resulting in generated images that lack variety\. This is primarily because SFT relies on a limited set of fixed ground\-truth images\. To address this, we introduceCharacter\-GRPO, a reinforcement learning stage dedicated to the T2I branch\. In this stage, the model is no longer restricted to a single ground\-truth image; instead, it generates multiple samples\{I1,I2,…,IG\}\\\{I\_\{1\},I\_\{2\},\\dots,I\_\{G\}\\\}for each character\-specific prompt\. This multi\-sample generation allows the model to explore a broader generation space, effectively mitigating overfitting\. Furthermore, since GRPO does not require ground\-truth images, it serves as a self\-evolving data expansion mechanism that increases the diversity and volume of character\-image mappings beyond the original training set\. To guide the policyπθ\\pi\_\{\\theta\}toward generating character\-consistent and diverse multimodal content, we define a comprehensive reward functionRt​o​t​a​lR\_\{total\}comprising alignment and diversity components\.

Text\-Image Alignment Rewards\.These rewards, shown in[Figure3](https://arxiv.org/html/2605.08129#S3.F3), ensure that the generated visual contentIg​e​nI\_\{gen\}adheres to the textual promptTp​r​o​m​p​tT\_\{prompt\}and the character’s intrinsic attributes\. The CLIP Similarity Reward \(rC​L​I​Pr\_\{CLIP\}\) measures the semantic alignment between the image and the prompt using CLIP:

rC​L​I​P=cos⁡\(ϕI​\(Ig​e​n\),ϕT​\(Tp​r​o​m​p​t\)\)r\_\{CLIP\}=\\cos\(\\phi\_\{I\}\(I\_\{gen\}\),\\phi\_\{T\}\(T\_\{prompt\}\)\)\(2\)whereϕI\\phi\_\{I\}andϕT\\phi\_\{T\}denote the CLIP image and text encoders, respectively\. The VQA Consistency Reward \(rV​Q​Ar\_\{VQA\}\) verifies fine\-grained character traits based on the correctness of the model’s answers to specific annotated questions:

rV​Q​A=\{1,if VQA​\(Ig​e​n,TV​Q​A\)​matches​Tt​r​u​t​h0,otherwiser\_\{VQA\}=\\begin\{cases\}1,&\\text\{if VQA\}\(I\_\{gen\},T\_\{VQA\}\)\\text\{ matches \}T\_\{truth\}\\\\ 0,&\\text\{otherwise\}\\end\{cases\}\(3\)
Diversity Rewards\.To avoid overfitting and encourage diverse generation, we penalize redundancy and similarity to the training set, shown in[Figure3](https://arxiv.org/html/2605.08129#S3.F3)\. The Perceptual Diversity Reward \(rd​i​vr\_\{div\}\) utilizes the Learned Perceptual Image Patch Similarity \(LPIPS\) to measure the visual variance within the sampled group:

rd​i​v=1G​\(G−1\)​∑i=1G∑j≠iGLPIPS​\(Ii,Ij\)r\_\{div\}=\\frac\{1\}\{G\(G\-1\)\}\\sum\_\{i=1\}^\{G\}\\sum\_\{j\\neq i\}^\{G\}\\text\{LPIPS\}\(I\_\{i\},I\_\{j\}\)\(4\)The Trainset Similarity Penalty \(ps​i​mp\_\{sim\}\) prevents the model from memorizing training samples while ensuring it retains the target character’s essential features\.ps​i​mp\_\{sim\}is calculated with an upper thresholdτh​i​g​h\\tau\_\{high\}and a lower thresholdτl​o​w\\tau\_\{low\}:

ps​i​m=\{−\(sm​a​x−τh​i​g​h\),if​sm​a​x\>τh​i​g​h−\(τl​o​w−sm​a​x\),if​sm​a​x<τl​o​w0,otherwisep\_\{sim\}=\\begin\{cases\}\-\(s\_\{max\}\-\\tau\_\{high\}\),&\\text\{if \}s\_\{max\}\>\\tau\_\{high\}\\\\ \-\(\\tau\_\{low\}\-s\_\{max\}\),&\\text\{if \}s\_\{max\}<\\tau\_\{low\}\\\\ 0,&\\text\{otherwise\}\\end\{cases\}\(5\)wheresm​a​xs\_\{max\}refers to maximum cosine similarity betweenIg​e​nI\_\{gen\}and the training set𝒟t​r​a​i​n\\mathcal\{D\}\_\{train\}in the DINO feature space:

sm​a​x=maxIk∈𝒟t​r​a​i​n⁡\(SimDINO​\(Ig​e​n,Ik\)\)s\_\{max\}=\\max\_\{I\_\{k\}\\in\\mathcal\{D\}\_\{train\}\}\\left\(\\text\{Sim\}\_\{\\text\{DINO\}\}\(I\_\{gen\},I\_\{k\}\)\\right\)
Comprehensive Reward\.The final rewardRiR\_\{i\}for a sample in the group is a combination of the above components, providing a signal for GRPO’s advantage computation:

Ri=α⋅rC​L​I​P\+β⋅rV​Q​A⏟Text\-Image Alignment\+γ⋅rd​i​v\+δ⋅ps​i​m⏟DiversityR\_\{i\}=\\underbrace\{\\alpha\\cdot r\_\{CLIP\}\+\\beta\\cdot r\_\{VQA\}\}\_\{\\text\{Text\-Image Alignment\}\}\+\\underbrace\{\\gamma\\cdot r\_\{div\}\+\\delta\\cdot p\_\{sim\}\}\_\{\\text\{Diversity\}\}\(6\)whereα,β,γ,δ\\alpha,\\beta,\\gamma,\\deltaare hyperparameters, with default values set toα=0\.45,β=0\.3,γ=0\.1,δ=0\.15\\alpha=0\.45,\\beta=0\.3,\\gamma=0\.1,\\delta=0\.15\.

## 5Experiment

### 5\.1Experiment Setup

Implementation Details\.We select BAGEL\(Denget al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib31)\)as the base model\. The Unified\-SFT stage freezes the VAE and set the training step count to 500\. The Character\-GRPO stage freezes the understanding part, including the ViT\. All experiments were conducted on NVIDIA H20 GPUs, requiring about 100 GPU hours per character\. More training details are in the Appendix[AppendixG](https://arxiv.org/html/2605.08129#A7)\.

![Refer to caption](https://arxiv.org/html/2605.08129v1/x4.png)Figure 4:Qualitative comparison between UniCharacter and DraeamBooth\(Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22)\), UniCTokens\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\)and Qwen2\.5\-VL\(Baiet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib53)\)on various tasks and cases\.In Role\-Play cases, UniCharacter is superior for it effectively shows Chandler’s personality through concise, sarcastic humor, while Qwen2\.5\-VL breaks character by being long\-winded and explaining his feelings too much\.Baselines\.Due to the limited body of existing work on personalized unified models, we chose UniCTokens\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\)as the only baseline in the most related domain\. To evaluate the model’s personalized generation capabilities, we established a baseline equivalent to the DreamBooth\(Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22)\)method by freezing thevisual\_undcomponent of BAGEL and fine\-tuning it solely on T2I data\. For assessing personalized understanding and role\-playing abilities, we selected Qwen2\.5\-VL\-7B\(Baiet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib53)\)with Text Prompt \(TP\) as a baseline, providing the model with profiles and sample dialogues as text prompts\.

Metrics\.We evaluate the model’s performance based on five tasks\. For theText\-based Role Playtask, we employ an “LLM\-as\-Judge” methodology for comparison\. Specifically, we use the Qwen3 model\(Team,[2025](https://arxiv.org/html/2605.08129#bib.bib52)\)to give out scores on “Memorization”, “Personality”, and “Diversity” for each of the model’s responses\. Details of our “LLM\-as\-Judge” methodology are in the Appendix[SectionG\.2](https://arxiv.org/html/2605.08129#A7.SS2)\. For theT2IandMultimodal Role\-Playtasks, we assess performance using CLIP\-I, CLIP\-T, and DINO metrics\. For theKnowledge QAandVQAtasks, we create approximately 10 multiple\-choice knowledge\-based questions for each character\. Additionally, we generate about five multiple\-choice VQA questions for each image associated with every character\. Accuracy is used as the evaluation metric\.

### 5\.2Quantitative Results

As shown in Table[3](https://arxiv.org/html/2605.08129#S4.T3), our method demonstrates strong overall performance across all tasks\. On image generation tasks, our model outperforms leading T2I baselines, such as DreamBooth\(Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22)\)\. In text\-based role\-playing, our approach surpasses a strong visual language model \(e\.g\., Qwen2\.5\-VL\(Baiet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib53)\)\+TP\) in all metrics, indicating superior role embodiment and generative capability\. On knowledge\-based QA and VQA tasks, our method also surpasses state\-of\-the\-art VLM, confirming that it retains strong comprehension abilities without compromising generative quality\. Furthermore, our method significantly outperforms UniCTokens\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\)across most evaluated tasks, highlighting its stronger unified modeling capacity\.

![Refer to caption](https://arxiv.org/html/2605.08129v1/x5.png)Figure 5:Qualitative results on two characters\.The two characters correspond to Chandler and Joey in the dataset\. We qualitatively demonstrate the model’s performance across multimodal role\-play, text\-to\-image generation, Knowledge QA, and VQA\.
### 5\.3Qualitative Results

As shown in[Figure4](https://arxiv.org/html/2605.08129#S5.F4), we conduct a qualitative comparison between UniCharacter and the baseline method\. We compare with DreamBooth on the T2I generation task, and UniCharacter achieves superior results in both image quality and text\-image alignment\. We compare the effectiveness of the multimodal role\-play task with UniCTokens\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\)\. Our UniCharacter not only has a clear advantage in identity fidelity, but also generates text responses that better match the character, giving users a more character\-consistent role\-play interaction experience\. We show the qualitative results in Fig\.[5](https://arxiv.org/html/2605.08129#S5.F5), where our UniCharacter stays consistent with each character’s identity\. The text responses are also aligned with the character’s personality traits\. In addition, UniCharacter reduces the overfitting problem in conventional customization methods\. The character’s state can be effectively controlled by user input while remaining consistent with the model’s responses\. Overall, our model shows strong multimodal role\-playing ability\. More qualitative results are displayed in the Appendix[AppendixC](https://arxiv.org/html/2605.08129#A3)\.

Table 4:Average performance across all characters under different training stage settings\.Bestperformances are highlighted\.SettingT2IMultimodal Role\-PlayTrainset Similarity\(T2I\)Trainset Similarity\(Multimodal Role\-Play\)CLIP\-ICLIP\-TDINOCLIP\-ICLIP\-TDINOCLIP\-I↓\\downarrowDINO↓\\downarrowCLIP\-I↓\\downarrowDINO↓\\downarroww/o GRPO0\.850\.300\.880\.820\.310\.850\.890\.920\.900\.92w/ GRPO0\.880\.330\.910\.860\.330\.890\.860\.900\.870\.89

Table 5:Average performance across several characters under different GRPO reward settings\.BestandSecond bestperformances are highlighted\.SettingT2IMultimodal Role\-PlayTrainset Similarity\(T2I\)Trainset Similarity\(Multimodal Role\-Play\)CLIP\-ICLIP\-TDINOCLIP\-ICLIP\-TDINOCLIP\-I↓\\downarrowDINO↓\\downarrowCLIP\-I↓\\downarrowDINO↓\\downarroww/o CLIP\-T Reward0\.870\.290\.920\.860\.320\.890\.910\.940\.900\.95w/o VQA Reward0\.770\.280\.830\.800\.310\.840\.810\.840\.840\.89w/o Diversity Reward0\.870\.300\.920\.860\.320\.880\.900\.930\.910\.94w/o Similarity Penalty0\.870\.300\.910\.860\.320\.890\.900\.930\.910\.94UniCharacter0\.880\.310\.930\.880\.320\.920\.890\.920\.880\.92

### 5\.4Ablation Study

We compared the impact of including versus excluding the GRPO training stage on model performance, with quantitative results presented in[Table4](https://arxiv.org/html/2605.08129#S5.T4)\. We observed that for image\-related tasks, including T2I generation and Multimodal Role\-Play, models trained with the GRPO stage perform better in image quality and training set similarity metrics than those trained without it\. Qualitative results illustrating the impact of the GRPO training stage are also presented in[Figure6](https://arxiv.org/html/2605.08129#A2.F6)within the Appendix[AppendixD](https://arxiv.org/html/2605.08129#A4)\.

We evaluated various reward and penalty effects during the GRPO training stage in[Table5](https://arxiv.org/html/2605.08129#S5.T5)\. Results show that text\-image alignment parts enhance image quality, while diversity parts decrease similarity to the training set\. More ablation studies are in the Appendix[AppendixD](https://arxiv.org/html/2605.08129#A4)\.

## 6Conclusion

We introduce Customized Multimodal Role\-Play \(CMRP\), a new task for building a multimodal virtual character, and propose UniCharacter, a unified framework that turns a general\-purpose multimodal foundation model into a coherent, personalized character using only a handful of images and dialogue examples\. Built on the new RoleScape\-20 benchmark, UniCharacter jointly models role\-play chatting, thinking processes, knowledge QA, VQA, and T2I generation, aligning persona, dialogue style, and visual identity\. Our two\-stage training framework mitigates overfitting in the few\-shot regime, improving both image diversity and generalization\. Ablation studies verify the importance of our reward and the Character\-GRPO stage for strengthening both role\-play quality and multimodal alignment\.

Limitations and Future Directions\.While UniCharacter demonstrates strong performance in the CMRP task, several avenues remain for exploration\. First, the current task is built on text and images\. Extending it to customized video generation and ensuring temporal consistency and character identity across frames remains a significant challenge\. Moreover, although UniCharacter handles standard interactions well, the CMRP task is limited to single\-turn scenarios, leaving its stability in multi\-turn or long\-turn dialogues untested, suggesting a need for more robust long\-term memory mechanisms to prevent role\-drifting\. Future work will also focus on real\-time deployment, safety controllability, and user\-in\-the\-loop customization to foster more immersive and trustworthy character agents\.

## Acknowledgements

This work is supported by the National Key Research and Development Program of China \(No\. 2023YFC3807600\)\.

## Impact Statement

This paper introduces UniCharacter, a framework for Customized Multimodal Role\-Play that leverages the capabilities of unified multimodal models to create interactive agents that are both character\-rich and deeply immersive\. While our work carries broader societal implications, we believe there are no specific concerns that necessitate detailed discussion in this context\.

## References

- Y\. Alaluf, E\. Richardson, S\. Tulyakov, K\. Aberman, and D\. Cohen\-Or \(2024\)MyVLM: personalizing vlms for user\-specific queries\.arXiv preprint arXiv:2403\.14599\.Cited by:[Table 1](https://arxiv.org/html/2605.08129#S1.T1.5.1.5.1),[Table 2](https://arxiv.org/html/2605.08129#S2.T2.8.1.7.1),[§2](https://arxiv.org/html/2605.08129#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p2.1)\.
- R\. An, S\. Yang, M\. Lu, R\. Zhang, K\. Zeng, Y\. Luo, J\. Cao, H\. Liang, Y\. Chen, Q\. She,et al\.\(2024\)Mc\-llava: multi\-concept personalized vision\-language model\.arXiv preprint arXiv:2411\.11706\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p2.1)\.
- R\. An, S\. Yang, R\. Zhang, Z\. Shen, M\. Lu, G\. Dai, H\. Liang, Z\. Guo, S\. Yan, Y\. Luo,et al\.\(2025\)UniCTokens: boosting personalized understanding and generation via unified concept tokens\.arXiv preprint arXiv:2505\.14671\.Cited by:[Appendix E](https://arxiv.org/html/2605.08129#A5.p1.1),[§F\.1](https://arxiv.org/html/2605.08129#A6.SS1.p2.1),[Table 1](https://arxiv.org/html/2605.08129#S1.T1.5.1.7.1),[§1](https://arxiv.org/html/2605.08129#S1.p2.1),[§1](https://arxiv.org/html/2605.08129#S1.p6.1),[Table 2](https://arxiv.org/html/2605.08129#S2.T2.8.1.8.1),[§2](https://arxiv.org/html/2605.08129#S2.p3.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p2.1),[Table 3](https://arxiv.org/html/2605.08129#S4.T3.7.1.5.1),[Figure 4](https://arxiv.org/html/2605.08129#S5.F4.2.1),[Figure 4](https://arxiv.org/html/2605.08129#S5.F4.4.2),[§5\.1](https://arxiv.org/html/2605.08129#S5.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.08129#S5.SS2.p1.1),[§5\.3](https://arxiv.org/html/2605.08129#S5.SS3.p1.1)\.
- J\. Bai, S\. Bai, S\. Yang, S\. Wang, S\. Tan, P\. Wang, J\. Lin, C\. Zhou, and J\. Zhou \(2023\)Qwen\-vl: a versatile vision\-language model for understanding, localization, text reading, and beyond\.arXiv preprint arXiv:2308\.12966\.Cited by:[§F\.2](https://arxiv.org/html/2605.08129#A6.SS2.p4.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p3.10)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin \(2025\)Qwen2\.5\-vl technical report\.arXiv preprint arXiv:2502\.13923\.Cited by:[Appendix E](https://arxiv.org/html/2605.08129#A5.p1.1),[§F\.2](https://arxiv.org/html/2605.08129#A6.SS2.p4.1),[§1](https://arxiv.org/html/2605.08129#S1.p6.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p3.10),[Table 3](https://arxiv.org/html/2605.08129#S4.T3.7.1.4.1),[Figure 4](https://arxiv.org/html/2605.08129#S5.F4.2.1),[Figure 4](https://arxiv.org/html/2605.08129#S5.F4.4.2),[§5\.1](https://arxiv.org/html/2605.08129#S5.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.08129#S5.SS2.p1.1)\.
- R\. Chen, A\. Arditi, H\. Sleight, O\. Evans, and J\. Lindsey \(2025a\)Persona vectors: monitoring and controlling character traits in language models\.arXiv preprint arXiv:2507\.21509\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- X\. Chen, Z\. Wu, X\. Liu, Z\. Pan, W\. Liu, Z\. Xie, X\. Yu, and C\. Ruan \(2025b\)Janus\-pro: unified multimodal understanding and generation with data and model scaling\.arXiv preprint arXiv:2501\.17811\.Cited by:[§1](https://arxiv.org/html/2605.08129#S1.p2.1),[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- Y\. Cui, H\. Chen, H\. Deng, X\. Huang, X\. Li, J\. Liu, Y\. Liu, Z\. Luo, J\. Wang, W\. Wang, Y\. Wang, C\. Wang, F\. Zhang, Y\. Zhao, T\. Pan, X\. Li, Z\. Hao, W\. Ma, Z\. Chen, Y\. Ao, T\. Huang, Z\. Wang, and X\. Wang \(2025\)Emu3\.5: native multimodal models are world learners\.arXiv preprint arXiv:2510\.26583\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- C\. Deng, D\. Zhu, K\. Li, C\. Gou, F\. Li, Z\. Wang, S\. Zhong, W\. Yu, X\. Nie, Z\. Song, G\. Shi, and H\. Fan \(2025\)Emerging properties in unified multimodal pretraining\.arXiv preprint arXiv:2505\.14683\.Cited by:[§1](https://arxiv.org/html/2605.08129#S1.p2.1),[§2](https://arxiv.org/html/2605.08129#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.08129#S5.SS1.p1.1)\.
- R\. Gal, Y\. Alaluf, Y\. Atzmon, O\. Patashnik, A\. H\. Bermano, G\. Chechik, and D\. Cohen\-Or \(2022\)An image is worth one word: personalizing text\-to\-image generation using textual inversion\.arXiv preprint arxiv:2208\.01618\.Cited by:[§1](https://arxiv.org/html/2605.08129#S1.p1.1),[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- G\. Gemini Team \(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§F\.1](https://arxiv.org/html/2605.08129#A6.SS1.p3.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- Y\. Guo, C\. Yang, A\. Rao, Z\. Liang, Y\. Wang, Y\. Qiao, M\. Agrawala, D\. Lin, and B\. Dai \(2024\)AnimateDiff: animate your personalized text\-to\-image diffusion models without specific tuning\.ICLR\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- H\. Hao, J\. Han, C\. Li, Y\. Li, and X\. Yue \(2025\)RAP: retrieval\-augmented personalization for multimodal large language models\.CVPR\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p2.1)\.
- J\. Jiang, C\. Si, J\. Luo, H\. Zhang, and C\. Ma \(2025\)Co\-reinforcement learning for unified multimodal understanding and generation\.arXiv preprint arXiv:2505\.17534\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- D\. P\. Kingma and J\. Ba \(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§G\.1](https://arxiv.org/html/2605.08129#A7.SS1.p1.1)\.
- N\. Kumari, B\. Zhang, R\. Zhang, E\. Shechtman, and J\. Zhu \(2023\)Multi\-concept customization of text\-to\-image diffusion\.CVPR\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- P\. Langley \(2000\)Crafting papers on machine learning\.InProceedings of the 17th International Conference on Machine Learning \(ICML 2000\),P\. Langley \(Ed\.\),Stanford, CA,pp\. 1207–1216\.Cited by:[§G\.2](https://arxiv.org/html/2605.08129#A7.SS2.p7.1)\.
- C\. Li, Z\. Leng, C\. Yan, J\. Shen, H\. Wang, W\. MI, Y\. Fei, X\. Feng, S\. Yan, H\. Wang, L\. Zhan, Y\. Jia, P\. Wu, and H\. Sun \(2023\)ChatHaruhi: reviving anime character in reality via large language model\.arXiv preprint arXiv:2308\.09597\.Cited by:[Table 2](https://arxiv.org/html/2605.08129#S2.T2.8.1.4.1),[§2](https://arxiv.org/html/2605.08129#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p2.1)\.
- J\. Liu, G\. Liu, J\. Liang, Y\. Li, J\. Liu, X\. Wang, P\. Wan, D\. Zhang, and W\. Ouyang \(2025\)Flow\-grpo: training flow matching models via online rl\.arXiv preprint arXiv:2505\.05470\.Cited by:[Appendix B](https://arxiv.org/html/2605.08129#A2.p2.3),[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- W\. Mao, Z\. Yang, and M\. Z\. Shou \(2025\)UniRL: self\-improving unified multimodal models via supervised and reinforcement learning\.arXiv preprint arXiv:2505\.23380\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- T\. Nguyen, H\. Liu, Y\. Li, M\. Cai, U\. Ojha, and Y\. J\. Lee \(2024\)Yo'llava: your personalized language and vision assistant\.NeurIPS\.Cited by:[Table 1](https://arxiv.org/html/2605.08129#S1.T1.5.1.4.1),[§1](https://arxiv.org/html/2605.08129#S1.p1.1),[Table 2](https://arxiv.org/html/2605.08129#S2.T2.8.1.6.1),[§2](https://arxiv.org/html/2605.08129#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p2.1)\.
- T\. Nguyen, K\. K\. Singh, J\. Shi, T\. Bui, Y\. J\. Lee, and Y\. Li \(2025\)Yo'chameleon: personalized vision and language generation\.CVPR\.Cited by:[Table 1](https://arxiv.org/html/2605.08129#S1.T1.5.1.6.1),[§1](https://arxiv.org/html/2605.08129#S1.p2.1),[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- Y\. Oh, J\. Mok, D\. Chung, J\. Shin, S\. Park, J\. Barthelemy, and S\. Yoon \(2025\)RePIC: reinforced post\-training for personalizing multi\-modal language models\.arXiv preprint arXiv:2506\.18369\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p2.1)\.
- OpenAI \(2024\)GPT\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§F\.2](https://arxiv.org/html/2605.08129#A6.SS2.p2.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p3.10)\.
- N\. Ruiz, Y\. Li, V\. Jampani, Y\. Pritch, M\. Rubinstein, and K\. Aberman \(2022\)DreamBooth: fine tuning text\-to\-image diffusion models for subject\-driven generation\.arXiv preprint arxiv:2208\.12242\.Cited by:[Appendix E](https://arxiv.org/html/2605.08129#A5.p1.1),[Table 1](https://arxiv.org/html/2605.08129#S1.T1.5.1.3.1),[§1](https://arxiv.org/html/2605.08129#S1.p1.1),[§1](https://arxiv.org/html/2605.08129#S1.p6.1),[Table 2](https://arxiv.org/html/2605.08129#S2.T2.8.1.5.1),[§2](https://arxiv.org/html/2605.08129#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p2.1),[Table 3](https://arxiv.org/html/2605.08129#S4.T3.7.1.3.1),[Figure 4](https://arxiv.org/html/2605.08129#S5.F4.2.1),[Figure 4](https://arxiv.org/html/2605.08129#S5.F4.4.2),[§5\.1](https://arxiv.org/html/2605.08129#S5.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.08129#S5.SS2.p1.1)\.
- Y\. Shao, L\. Li, J\. Dai, and X\. Qiu \(2023\)Character\-llm: a trainable agent for role\-playing\.EMNLP\.Cited by:[§G\.2](https://arxiv.org/html/2605.08129#A7.SS2.p6.1),[Table 1](https://arxiv.org/html/2605.08129#S1.T1.5.1.2.1),[§1](https://arxiv.org/html/2605.08129#S1.p1.1),[Table 2](https://arxiv.org/html/2605.08129#S2.T2.8.1.3.1),[§2](https://arxiv.org/html/2605.08129#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p2.1)\.
- Q\. Shi, J\. Bai, Z\. Zhao, W\. Chai, K\. Yu, J\. Wu, S\. Song, Y\. Tong, X\. Li, X\. Li,et al\.\(2025a\)Muddit: liberating generation beyond text\-to\-image with a unified discrete diffusion model\.arXiv preprint arXiv:2505\.23606\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- Q\. Shi, L\. Qi, J\. Wu, J\. Bai, J\. Wang, Y\. Tong, and X\. Li \(2025b\)DreamRelation: bridging customization and relation generaion\.InCVPR,Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- Q\. Shi, J\. Wu, J\. Bai, J\. Zhang, L\. Qi, X\. Li, and Y\. Tong \(2025c\)Decouple and track: benchmarking and improving video diffusion transformers for motion transfer\.InICCV,Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p2.1)\.
- Q\. Team \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§F\.2](https://arxiv.org/html/2605.08129#A6.SS2.p1.1),[§F\.2](https://arxiv.org/html/2605.08129#A6.SS2.p3.1),[§F\.2](https://arxiv.org/html/2605.08129#A6.SS2.p4.1),[§G\.2](https://arxiv.org/html/2605.08129#A7.SS2.p2.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p3.10),[§5\.1](https://arxiv.org/html/2605.08129#S5.SS1.p3.1)\.
- P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, Y\. Fan, K\. Dang, M\. Du, X\. Ren, R\. Men, D\. Liu, C\. Zhou, J\. Zhou, and J\. Lin \(2024\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.arXiv preprint arXiv:2409\.12191\.Cited by:[§F\.2](https://arxiv.org/html/2605.08129#A6.SS2.p4.1),[§3\.2](https://arxiv.org/html/2605.08129#S3.SS2.p3.10)\.
- X\. Wang, H\. Wang, Y\. Zhang, X\. Yuan, R\. Xu, J\. Huang, S\. Yuan, H\. Guo, J\. Chen, W\. Wang, Y\. Xiao, and S\. Zhou \(2025\)CoSER: coordinating llm\-based persona simulation of established roles\.arXiv preprint arXiv:2502\.09082\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- Z\. M\. Wang, Z\. Peng, H\. Que, J\. Liu, W\. Zhou, Y\. Wu, H\. Guo, R\. Gan, Z\. Ni, J\. Yang, M\. Zhang, Z\. Zhang, W\. Ouyang, K\. Xu, S\. W\. Huang, J\. Fu, and J\. Peng \(2023\)RoleLLM: benchmarking, eliciting, and enhancing role\-playing abilities of large language models\.arXiv preprint arXiv: 2310\.00746\.Cited by:[§1](https://arxiv.org/html/2605.08129#S1.p1.1),[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- C\. Wu, P\. Zheng, R\. Yan, S\. Xiao, X\. Luo, Y\. Wang, W\. Li, X\. Jiang, Y\. Liu, J\. Zhou, Z\. Liu, Z\. Xia, C\. Li, H\. Deng, J\. Wang, K\. Luo, B\. Zhang, D\. Lian, X\. Wang, Z\. Wang, T\. Huang, and Z\. Liu \(2025a\)OmniGen2: exploration to advanced multimodal generation\.arXiv preprint arXiv:2506\.18871\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- J\. Wu, X\. Li, H\. Ding, X\. Li, G\. Cheng, Y\. Tong, and C\. C\. Loy \(2023\)Betrayed by captions: joint caption grounding and generation for open vocabulary instance segmentation\.InICCV,Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- J\. Wu, X\. Li, C\. Si, S\. Zhou, J\. Yang, J\. Zhang, Y\. Li, K\. Chen, Y\. Tong, Z\. Liu,et al\.\(2024\)Towards language\-driven video inpainting via multimodal large language models\.InCVPR,Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- J\. Wu, C\. Tang, J\. Wang, Y\. Zeng, X\. Li, and Y\. Tong \(2025b\)Diffsensei: bridging multi\-modal llms and diffusion models for customized manga generation\.InCVPR,Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- J\. Xie, W\. Mao, Z\. Bai, D\. J\. Zhang, W\. Wang, K\. Q\. Lin, Y\. Gu, Z\. Chen, Z\. Yang, and M\. Z\. Shou \(2024\)Show\-o: one single transformer to unify multimodal understanding and generation\.arXiv preprint arXiv:2408\.12528\.Cited by:[§1](https://arxiv.org/html/2605.08129#S1.p2.1)\.
- J\. Xie, Z\. Yang, and M\. Z\. Shou \(2025\)Show\-o2: improved native unified multimodal models\.arXiv preprint arXiv:2506\.15564\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- R\. Xu, X\. Wang, J\. Chen, S\. Yuan, X\. Yuan, J\. Liang, Z\. Chen, X\. Dong, and Y\. Xiao \(2024\)Character is destiny: can large language models simulate persona\-driven decisions in role\-playing?\.arXiv preprint arXiv:2404\.12138\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- L\. Yang, Y\. Tian, B\. Li, X\. Zhang, K\. Shen, Y\. Tong, and M\. Wang \(2025\)MMaDA: multimodal large diffusion language models\.arXiv preprint arXiv:2505\.15809\.Cited by:[§1](https://arxiv.org/html/2605.08129#S1.p2.1),[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.
- H\. Ye, J\. Zhang, S\. Liu, X\. Han, and W\. Yang \(2023\)IP\-adapter: text compatible image prompt adapter for text\-to\-image diffusion models\.arXiv preprint arxiv:2308\.06721\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- Y\. Zeng, V\. M\. Patel, H\. Wang, X\. Huang, T\. Wang, M\. Liu, and Y\. Balaji \(2024\)JeDi: joint\-image diffusion models for finetuning\-free personalized text\-to\-image generation\.CVPR\.Cited by:[§1](https://arxiv.org/html/2605.08129#S1.p1.1),[§2](https://arxiv.org/html/2605.08129#S2.p1.1)\.
- K\. Zheng, H\. Chen, H\. Ye, H\. Wang, Q\. Zhang, K\. Jiang, H\. Su, S\. Ermon, J\. Zhu, and M\. Liu \(2025\)Diffusionnft: online diffusion reinforcement with forward process\.arXiv preprint arXiv:2509\.16117\.Cited by:[§2](https://arxiv.org/html/2605.08129#S2.p3.1)\.

## Appendix Contents

## Appendix AIntroduction Video

To help readers quickly grasp the primary idea of our work, we provide a 5\-minute introduction video\. Please refer to “introduction\_video\.mp4” in the supplementary file\.

## Appendix BPreliminaries

Unified Multimodal Architecture\.We build our framework upon a pre\-trained unified multimodal model capable of simultaneous image\-text understanding and generation\. For an input imageII, the model extracts two distinct representations to support different task modalities: Semantic Representation\. A semantic encoderEsemE\_\{\\text\{sem\}\}captures the high\-level context and character attributes, denoted asFsem=Esem​\(I\)F\_\{\\text\{sem\}\}=E\_\{\\text\{sem\}\}\(I\), and a generation\-oriented encoderEpixE\_\{\\text\{pix\}\}maps the image into pixel\-level latent tokens for visual reconstruction, denoted asZpix=Epix​\(I\)Z\_\{\\text\{pix\}\}=E\_\{\\text\{pix\}\}\(I\)\. The model adopts a dual\-paradigm generation approach: text sequences are produced via autoregressive next\-token prediction, while images are synthesized through a rectified flow\-based generative process\. This unified structure ensures that the model can maintain cross\-modal consistency by sharing a common latent space for both understanding and synthesis\.

GRPO on Flow Matching\.To optimize flow\-based generative models, we treat the denoising process as an MDP and apply Group Relative Policy Optimization \(GRPO\), inspired by Flow\-GRPO\(Liuet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib64)\)\. Unlike traditional methods that require a critic, GRPO estimates the advantageA^ti\\hat\{A\}\_\{t\}^\{i\}by normalizing rewards across a group ofGGtrajectories sampled from the same promptcc:

A^ti=R​\(𝒙0i,c\)−mean​\(\{R​\(𝒙0i,c\)\}i=1G\)std​\(\{R​\(𝒙0i,c\)\}i=1G\)\\hat\{A\}\_\{t\}^\{i\}=\\frac\{R\(\\boldsymbol\{x\}\_\{0\}^\{i\},c\)\-\\text\{mean\}\(\\\{R\(\\boldsymbol\{x\}\_\{0\}^\{i\},c\)\\\}\_\{i=1\}^\{G\}\)\}\{\\text\{std\}\(\\\{R\(\\boldsymbol\{x\}\_\{0\}^\{i\},c\)\\\}\_\{i=1\}^\{G\}\)\}\(7\)The training objective𝒥Flow\-GRPO​\(θ\)\\mathcal\{J\}\_\{\\text\{Flow\-GRPO\}\}\(\\theta\)maximizes a clipped surrogate loss combined with a KL\-divergence penalty to ensure policy stability:

𝒥GRPO​\(θ\)=𝔼𝒄∼𝒞,\{𝒙i\}i=1G∼πθold\(⋅\|𝒄\)​f​\(r,A^,θ,ϵ,β\)\\mathcal\{J\}\_\{\\text\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{\\boldsymbol\{c\}\\sim\\mathcal\{C\},\\\{\\boldsymbol\{x\}^\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\cdot\|\\boldsymbol\{c\}\)\}f\(r,\\hat\{A\},\\theta,\\epsilon,\\beta\)\(8\)where:

f\(r,A^,θ,\\displaystyle f\(r,\\hat\{A\},\\theta,ϵ,β\)=1G∑i=1G1T∑t=0T−1\(min\(rti\(θ\)A^ti,\\displaystyle\\epsilon,\\beta\)=\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\bigg\(\\min\\Big\(r\_\{t\}^\{i\}\(\\theta\)\\hat\{A\}\_\{t\}^\{i\},clip\(rti\(θ\),1−ϵ,1\+ϵ\)A^ti\)−βDKL\(πθ\|\|πref\)\)\\displaystyle\\text\{clip\}\(r\_\{t\}^\{i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{t\}^\{i\}\\Big\)\-\\beta D\_\{\\text\{KL\}\}\(\\pi\_\{\\theta\}\|\|\\pi\_\{\\text\{ref\}\}\)\\bigg\)andrti​\(θ\)=pθ​\(𝒙t−1i\|𝒙ti,c\)pθold​\(𝒙t−1i\|𝒙ti,c\)r\_\{t\}^\{i\}\(\\theta\)=\\frac\{p\_\{\\theta\}\(\\boldsymbol\{x\}\_\{t\-1\}^\{i\}\|\\boldsymbol\{x\}\_\{t\}^\{i\},c\)\}\{p\_\{\\theta\_\{\\text\{old\}\}\}\(\\boldsymbol\{x\}\_\{t\-1\}^\{i\}\|\\boldsymbol\{x\}\_\{t\}^\{i\},c\)\}is the importance sampling ratio\. Specifically, we setβ=0\\beta=0, meaning there is no KL\-divergence penalty\.

![Refer to caption](https://arxiv.org/html/2605.08129v1/x6.png)Figure 6:Qualitative comparisons of the ablation studies on model training stage \(w/o GRPO and w/ GRPO\)\.
## Appendix CMore Qualitative Results

Due to the large amount of qualitative results, we present them in a separate, anonymous local HTML page\. Please refer to “index\.html” in the supplementary file\. On this project page, we provide extensive qualitative results on various tasks and various characters\.

## Appendix DMore Ablation Studies

Here, we present a qualitative result to demonstrate the performance gains delivered by the Character\-GRPO training stage\. The results are in[Figure6](https://arxiv.org/html/2605.08129#A2.F6)\. Models trained with GRPO achieve superior performance across both text\-image alignment and image diversity\.

We also conducted two more ablation studies to provide insights into the impact of training data composition and inference strategies on model performance\.

In[Table6](https://arxiv.org/html/2605.08129#A4.T6), introducing Extension Dialogues improves text\-based metrics—particularly Memorization, Personality, and Diversity—by reducing repetitive responses in role\-play\. However, this change alters the balance between textual and visual training data, leading to a slight degradation in image\-related tasks \(T2I and Multimodal Role\-Play\) as measured by CLIP\-I, CLIP\-T, and DINO, along with increased Trainset Similarity scores \(indicating higher similarity to training data\)\. Adding the Thinking Process to the training data mitigates this issue: while Memorization slightly decreases compared to using only Extension Dialogues, Personality and Diversity remain high, and image quality metrics across both T2I and Multimodal Role\-Play consistently improve, surpassing even the Original setting\.

In[Table7](https://arxiv.org/html/2605.08129#A4.T7), when all models are trained with the same data \(“\+Extension Dialogues \+ Thinking Process”\), the effect of using the thinking process during inference depends on the training stage\. For models trained only with Unified\-SFT, enabling the thinking process at inference yields no improvement in Multimodal Role\-Play performance; CLIP\-I and DINO scores slightly decrease, and Trainset Similarity increases\. In contrast, for models trained with both Unified\-SFT and Character\-GRPO, applying the thinking process during inference leads to consistent gains in CLIP\-I, CLIP\-T, and DINO, with a marginal reduction in Trainset Similarity under the DINO metric\. This suggests that the benefits of inference\-time reasoning are contingent on the inclusion of the Character\-GRPO training stage\.

Table 6:Average performance across several characters under different training data settings\. Higher values are better except for Trainset Similarity \(lower is better\)\. The settings “Original”, “\+Extension Dialogues” and “\+Extension Dialogues \+ Thinking Data” differ only in training data composition and do not employ Character\-GRPO stage\. “\+Extension Dialogues \+ Thinking Data” uses the thinking process during inference\.BestandSecond bestperformances are highlighted\.SettingText\-based Role PlayT2IMultimodal Role\-PlayTrainset Similarity\(T2I\)Trainset Similarity\(Multimodal Role\-Play\)MemorizationPersonalityDiversityCLIP\-ICLIP\-TDINOCLIP\-ICLIP\-TDINOCLIP\-I↓\\downarrowDINO↓\\downarrowCLIP\-I↓\\downarrowDINO↓\\downarrowOriginal4\.4105\.1792\.6670\.8510\.2780\.8790\.8430\.2830\.8730\.9180\.9330\.9160\.934\+Extension Dialogues5\.4946\.7066\.3230\.8450\.2800\.8730\.8420\.2770\.8720\.9160\.9300\.9190\.935\+Extension Dialogues \+ Thinking Data5\.1926\.7836\.3330\.8580\.2840\.8830\.8630\.2870\.8880\.9210\.9420\.9320\.943

Table 7:Average performance across all characters under different inference settings for different models\. Higher values are better except for Trainset Similarity \(lower is better\)\. The settings “SFT” and “SFT\+GRPO” refer to training stage settings, while “w/o thinking” and “w/ thinking” refer to inference settings\. Training data setting is unified to “\+Extension Dialogues \+ Thinking Data”\.Green for betterandred for worseSettingMultimodal Role\-PlayTrainset Similarity\(Multimodal Role\-Play\)CLIP\-ICLIP\-TDINOCLIP\-I↓\\downarrowDINO↓\\downarrowSFT \(w/o thinking\)0\.8270\.3040\.8590\.8950\.921SFT \(w/ thinking\)0\.8250\.3060\.8530\.9030\.924SFT\+GRPO \(w/o thinking\)0\.8560\.3220\.8840\.8640\.895SFT\+GRPO \(w/ thinking\)0\.8600\.3260\.8860\.8690\.893

## Appendix EUser Studies

To conduct a comprehensive qualitative assessment of model performance, we performed a subjective user study comparing UniCharacter against the DreamBooth\(Ruizet al\.,[2022](https://arxiv.org/html/2605.08129#bib.bib22)\), UniCTokens\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\), and Qwen2\.5\-VL\(Baiet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib53)\)baselines\. For T2I generation, multimodal role\-play, and text role\-play, we sampled 3–5 generated outputs from UniCharacter and each baseline\. For the customized T2I generation task, evaluation criteria included the quality of generated images and alignment with both text and characters\. For the multimodal role\-play task, evaluation criteria encompassed the quality of character responses, the quality of generated images, and the degree of alignment between text and images\. For the text role\-play task, evaluation criteria focused on the alignment of responses with the input scenario, character personality, and linguistic style\. Results are presented in[Figure7](https://arxiv.org/html/2605.08129#A5.F7)\. Our model outperformed baselines across all three tasks, consistent with the findings from previous quantitative results\.

![Refer to caption](https://arxiv.org/html/2605.08129v1/figs/user_study.png)Figure 7:User study across tasks\.We conducted a user study on four methods—DreamBooth, Qwen2\.5\-VL, UniCTokens, and UniCharacter—across three tasks: T2I generation, multimodal role\-play, and text role\-play\. The width of all three bar charts is uniformly set to 100%\. Different colors represent different models, with percentages labeled on the corresponding bars\. UniCharacter outperformed the baselines across all tasks\.
## Appendix FDetailed Data Construction Pipeline

### F\.1Data Collection

Our comprehensive list of collected characters includes:

- •Human:, Adrien Brody, Coco, Friends\-Chandler, Friends\-Joey, Gao Qiqiang, Harry Potter\-Hermione, Leonardo, Will In Vietnam, Wukong
- •Animal:, Bo, Butin, Mam, Mydieu
- •Anime Character:, Genshin\-Furina, Mahjong Soul\-Ichihime, Mahjong Soul\-Miki Nikaidou, Mahjong Soul\-Rin Tohsaka, Mahjong Soul\-Saber, Mahjong Soul\-YuiYagi, Pokemon\-Pikachu

An overview of our RoleScape\-20 dataset is in[Figure8](https://arxiv.org/html/2605.08129#A6.F8)\. Images for Adrien Brody, Coco, Will In Vietnam, Bo, Butin, Mam, and Mydieu are sourced from UnifyBench\(Anet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib51)\); images for all other characters are collected and extracted by us from the internet, films, TV series, games, and anime\.

While gathering these images, we also collect corresponding character profiles and example dialogues from Wikipedia and other online sources\. For characters lacking sufficient dialogue samples, we generate annotated dialogues using Gemini 2\.5 Pro\(Gemini Team,[2025](https://arxiv.org/html/2605.08129#bib.bib57)\)via its web interface, followed by manual review and refinement\.

![Refer to caption](https://arxiv.org/html/2605.08129v1/x7.png)Figure 8:An overview of RoleScape\-20\.RoleScape\-20 contains 9 human characters, 4 animal characters, and 7 anime characters\.
### F\.2Data Annotation

Dialogues extension\.Based on the provided example resources, we extend the dialogues: starting from approximately 10 example dialogues and character profiles, we generate around 200 extended dialogues, a subset of which is used as our Diversity test set\. We use the Qwen3\(Team,[2025](https://arxiv.org/html/2605.08129#bib.bib52)\)large language model \(LLM\) as the automatic tool for dialogue expansion\. The prompt format we provide to the LLM is displayed in[Table8](https://arxiv.org/html/2605.08129#A6.T8)\.

Multi\-modal Role\-Play data annotation\.Additionally, to construct our Multimodal Role\-Play Data, we annotate each image\-dialogue pair with a reasoning process to guide generation and a standardized generation instruction—referred to in the main text as the “Thinking Process” and “Generation Instruction” respectively\. We use GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2605.08129#bib.bib56)\)as the annotation tool for this task\. The prompt format provided to the model is displayed in[Table9](https://arxiv.org/html/2605.08129#A6.T9)\.

Test Questions for Personality & Memorization\.We also curate a separate set of test questions specifically designed to evaluate the model’s Personality and Memorization capabilities\. These specialized test questions are more aligned with our evaluation objectives than general\-purpose questions\. For each of these two capabilities, we annotate approximately 20 test questions\. We use the Qwen3\(Team,[2025](https://arxiv.org/html/2605.08129#bib.bib52)\)LLM as the annotation tool\. The prompts displayed in[Tables11](https://arxiv.org/html/2605.08129#A6.T11)and[10](https://arxiv.org/html/2605.08129#A6.T10)are provided to the model\.

QA data\.For the Knowledge QA and VQA tasks, we construct separate training and test sets\. For Knowledge QA, we create approximately 100 QA\-formatted training samples and around 10 multiple\-choice test questions per character\. For VQA, we generate approximately 20 QA\-formatted training samples and 5 multiple\-choice test questions per character image\. We use Qwen3\(Team,[2025](https://arxiv.org/html/2605.08129#bib.bib52)\)and Qwen3\-VL\(Team,[2025](https://arxiv.org/html/2605.08129#bib.bib52); Baiet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib55); Wanget al\.,[2024](https://arxiv.org/html/2605.08129#bib.bib54); Baiet al\.,[2025](https://arxiv.org/html/2605.08129#bib.bib53)\)to annotate the Knowledge QA and VQA data, respectively\. The annotation prompts for the trainset and testset of Knowledge QA are displayed in[Tables12](https://arxiv.org/html/2605.08129#A6.T12)and[13](https://arxiv.org/html/2605.08129#A6.T13), and annotation prompts for the trainset and testset of VQA are displayed in[Tables14](https://arxiv.org/html/2605.08129#A6.T14)and[15](https://arxiv.org/html/2605.08129#A6.T15)\.

Table 8:Prompt for Qwen3 LLM to extend dialogues\.Prompt for Dialogues Extension

Please refer to the following information regarding \[Character Name\], along with 10 dialogue entries containing Character Descriptions and Openings \(each dialogue entry includes User Input and Character Response\)\. Generate 200 expanded dialogues that remain consistent with the character’s personality and linguistic style\. You may independently consider suitable aspects for the content, such as the character’s basic information, gameplay mechanics, questions about the character, and everyday chatter\.

Information regarding \[Character Name\] is as follows:

\[Character Profile\]

The following ten dialogue entries, complete with Character Description and Opening, are as follows:

\[Example Dialogues\]

Please format the generated extended dialogue as follows:

```
{
  "description": "",
  "opening": "",
  "scenes": [
    [
      {
        "role": "user",
        "text": "Hello."
      },
      {
        "role": "machine",
        "text": "Hi."
      }
    ]
  ]
}
```

Table 9:Prompt for GPT\-4o to annotate thinking processes and generation instructions\. \[Character Content\] refers to Character Description \+ User Input \+ Character Response\.Prompt for Thinking Process Annotation

You are an expert AI assistant specializing in generating creative prompts for text\-to\-image models\. Your task is to analyze a scene description and its corresponding visual representation \(an image\) to create a new, detailed annotation\.

This new annotation must contain two parts:

1\. ‘thinking\_process‘: A detailed, step\-by\-step explanation of how the provided image successfully visualizes the scene described in the text\. You must break down your reasoning, connecting the character’s personality, user input, and character response to the specific visual elements present in the image\. Explain the character’s expression, pose, the setting’s composition, and the overall mood as depicted in the image, and justify why these artistic choices are effective\.

2\. ‘generation\_instruction‘: A concise and effective text\-to\-image prompt that would generate an image as similar as possible to the one provided\. It should be a single string of comma\-separated keywords and phrases that accurately captures the image’s subject, art style, composition, lighting, and key details\. The generation instruction should not be longer than 55 words\.

Here is an example of the complete task:

\[Annotation Example\]

Remember, the generation\_instruction should be concise and not exceed 55 words\.

Now, please perform the same task for the following data:

\[INPUT DATA\]

```
{{
  "text": "[Character Content]",
  "character": "[Character Name]",
  "image": <image>
}}
```

\[YOUR ANNOTATION\]

```
{{
  "thinking_process": "...",
  "generation_instruction": "..."
}}
```

Table 10:Prompt for Qwen3 LLM to annotate personality questions\.Prompt for Personality Questions Annotation

Referencing dialogues and materials from the training set about \[Character Name\], generate 20 user inputs as a test set\. This will evaluate whether the model’s personality aligns with the original character\.

Materials concerning \[Character Name\] are as follows: \[Character Profile\]

Some of the dialogue from the training set is as follows: \[Example Dialogues\]

Personality: The model should mimic the way the character would think or speak, including speaking style and tone, as well as emotions and reactions under different circumstances\.

The generated question asks whether our model’s personality is similar to that of the original character \[Character Name\]\.

Table 11:Prompt for Qwen3 LLM to annotate memorization questions\.Prompt for Memorization Questions Annotation

Referencing dialogues and materials from the training set about \[Character Name\], generate 20 user inputs as a test set\. This will evaluate the model’s ability to memorize and accurately reflect the original character\.

Materials concerning \[Character Name\] are as follows: \[Character Profile\]

Some of the dialogue from the training set is as follows: \[Example Dialogues\]

Memorization: The model’s ability to recall relevant information about the character being portrayed, including precise and detailed knowledge about people, events, and objects associated with the role\.

The generated questions should be suitable for testing our model’s memorization of the character \[Character Name\]\.

Table 12:Prompt for Qwen3 LLM to annotate knowledge QA trainset\.Prompt for Knowledge QA Trainset Annotation

You are an expert data constructor for QA datasets\.

Goal

\- From the PROFILE\_TEXT below, create exactly 100 diverse, non\-overlapping QA items that teach factual knowledge about the character \[Character Name\] described by the profile\.

Source

\- PROFILE\_TEXT:

\[Character Profile\]

Output format

\- Produce a JSON array with 100 objects\.

\- Each object must follow this schema:

```
{
  "id": "",
  "question": "",
  "answer": "",
  "evidence": "",
  "type": "",
  "difficulty": "<easy | medium | hard>",
  "reasoning_steps": "",
  "confidence": "<0.01.0>"
}
```

Constraints and guidelines

\- Grounding: Do not invent or guess\. All answers must be fully supported by PROFILE\_TEXT\. If something is not stated, do not include a QA about it\.

\- Coverage: Cover all major aspects present in PROFILE\_TEXT \(background, roles, projects, timelines, metrics, skills, tools, publications, awards, locations, preferences, constraints, aspirations\)\.

\- Diversity: Vary question forms \(wh\-, yes/no, how/why, compare/contrast, list, numeric\)\. Mix difficulties \(about 40 easy, 40 medium, 20 hard\)\.

\- Granularity: Balance broad and fine\-grained details\. Prefer atomic facts over vague summaries\.

\- Non\-duplication: No two questions may target the same fact phrased differently\. Each QA must add unique value\.

\- Clarity: Questions must be self\-contained, unambiguous, and use the same terminology as PROFILE\_TEXT where possible\.

\- Answers: Prefer short spans \(1–30 words\) unless a list or procedure is required\. For lists, keep up to 5 key items unless the text explicitly provides a longer enumerated list\.

\- Evidence: Quote minimal verbatim spans \(one or more\) that directly support the answer\. Do not paraphrase in “evidence”\.

\- Reasoning: When an answer requires synthesis \(e\.g\., computing age from years, comparing statements\), briefly outline the steps in “reasoning\_steps”\.

\- Temporal correctness: Respect dates and tenses exactly as written; do not extrapolate current status unless explicitly stated\.

\- Sensitive data: Do not include personal data beyond what is explicitly present in PROFILE\_TEXT\.

\- Consistency: Use consistent units, names, and abbreviations as in PROFILE\_TEXT\.

\- IDs: Use stable, human\-readable IDs like “qa\_001” … “qa\_100”\.

Validation

\- Ensure the JSON is valid and contains exactly 100 items\.

\- Each “evidence” string must be a verbatim substring of PROFILE\_TEXT\.

\- No item may have empty fields\. No hallucinations\.

Deliverable

\- Return only the JSON array, nothing else\.

\- Ensure that the JSON is valid and contains exactly 100 items\.

Table 13:Prompt for Qwen3 LLM to annotate knowledge QA testset\.Prompt for Knowledge QA Testset Annotation

You are an expert author of knowledge\-based multiple\-choice QA \(MCQ\) items\. Your expertise lies in creating questions that assess factual knowledge about a character, with precise answers grounded strictly in the provided profile text\.

Task

\- Given the PROFILE\_TEXT below about the character \[Character Name\], you MUST generate exactly 10 single\-answer MCQ items that test factual knowledge about this character\.

Source

\- PROFILE\_TEXT:

\[Character Profile\]

Grounding Rules

\- Profile Evidence Only: Use ONLY explicitly stated information in PROFILE\_TEXT\. Do not rely on any outside knowledge\.

\- No Inference: Do not infer or extrapolate beyond what is directly stated in the profile\.

\- Clarity: If a detail is not clearly stated in the profile, do not ask about it\.

\- Accuracy: Every correct answer must be directly verifiable from the PROFILE\_TEXT\.

Coverage and Diversity \(10 questions total\)

\- You MUST include a mix of questions covering these aspects:

1\. Background/Origin: e\.g\., character’s background, species, role, etc\.

2\. Personality/Traits: Character’s defining personality characteristics\.

3\. Abilities/Skills: What the character can do or is known for\.

4\. Relationships: Connections to other characters, places, or things\.

5\. Preferences/Habits: What the character likes, dislikes, or commonly does\.

6\. Appearance: Physical characteristics or distinctive features\.

7\. History/Timeline: Past events or milestones in the character’s life\.

8\. Quotes/Speech: Notable phrases or speaking patterns\.

9\. Goals/Motivations: What drives the character\.

10\. Miscellaneous Facts: Other unique or interesting details\.

\- Difficulty Mix: The set of 10 questions MUST have this exact distribution: 4 easy, 4 medium, 2 hard\.

Question and Options Requirements

\- Clarity: Each question must be clear and unambiguous\.

\- Format: Provide exactly 4 options: A, B, C, D\. Only ONE option can be correct\.

\- Plausible Distractors: Options must be mutually exclusive\. Incorrect options \(distractors\) should be plausible but clearly wrong based on the PROFILE\_TEXT\. Avoid giveaway options\.

\- Consistency: Use consistent terminology as in PROFILE\_TEXT\.

\- Negative Phrasing: Avoid “All of the above” or “None of the above” unless it is genuinely the correct answer and used at most once\.

Output Format

\- Your entire response MUST be a single, valid JSON array containing 10 question objects\. Do not include any text or explanation outside of the JSON array\.

\- Adhere strictly to this schema for each object in the array:

```
{
  "id": "mcq_001" | "mcq_002" | ... | "mcq_010",
  "question": "<Question grounded in PROFILE_TEXT>",
  "options": {
    "A": "<option A>",
    "B": "<option B>",
    "C": "<option C>",
    "D": "<option D>"
  },
  "answer_key": "A" | "B" | "C" | "D",
  "difficulty": "easy" | "medium" | "hard",
  "evidence": "<Text that supports the correct answer>",
  "rationale": ""
}
```

Validation

\- Ensure the JSON is valid and contains exactly 10 items\.

\- Each “evidence” string must be a verbatim substring of PROFILE\_TEXT\.

\- No item may have empty fields\.

\- IDs must be “mcq\_001” through “mcq\_010”\.

Deliverable

\- Return only the JSON array, nothing else\.

\- Ensure that the JSON is valid and contains exactly 10 items with the correct difficulty distribution \(4 easy, 4 medium, 2 hard\)\.

Table 14:Prompt for Qwen3\-VL to annotate VQA trainset\.Prompt for VQA Trainset Annotation

You are an AI assistant specializing in creating training data for multimodal AI\. Your task is to generate 20 question\-and\-answer pairs based on the three pieces of information provided below: an image, a dialogue, and a character profile\.

The questions should primarily test understanding of the visual information in the image\. The answers should be accurate and derived from the provided context\.

Input Data:

1\. Character Profile:

\[Character Profile\]

2\. Character Image:

¡image¿

3\. Dialogue:

User Input: \[user\_input\]

Character Response: \[character\_response\]

Task Requirements:

Generate a total of 20 question\-and\-answer pairs for \[Character Name\]\. The questions should be answerable by referencing the \[Character Profile\] in conjunction with the \[Character Image\] and \[Dialogue\]\.

The set of 20 QA pairs must follow this distribution:

\- 15 Objective Questions: These questions should focus on factual, directly observable details in the image\.

\- 5 Subjective Questions: These questions should require inference about the character’s emotional state, thoughts, or intentions, based on interpreting the visual cues within the given context\.

Formatting:

Please format your output as follows\.

```
{{
  "qa_pairs": [
    {{
      "question": "",
      "answer": "",
      "type": "objective"
    }}
  ]
}}
```

Table 15:Prompt for Qwen3\-VL to annotate VQA testset\.Prompt for VQA Testset Annotation

You are an expert author of visual multiple\-choice QA \(MCQ\) items\. Your expertise lies in creating questions that are precise, fair, and grounded strictly in visual data, with plausible yet incorrect distractors\.

Task

\- Given ONE character image, you MUST generate exactly 5 single\-answer MCQ items that assess understanding of the specific character as depicted in THIS image\.

Grounding Rules

\- Visual Evidence Only: Use ONLY explicitly visible evidence in the image\. Do not rely on any outside knowledge of the IP/character\.

\- No Inference: Do not infer relationships, past events, or future actions\. Base all answers strictly on the static image provided\.

\- Legibility: Avoid reading tiny or unreadable text\. Only use text that is clearly legible\.

\- Clarity: If a detail is not clearly and unambiguously visible, do not ask about it\.

Coverage and Diversity \(5 questions total\)

\- You MUST include a mix of questions covering these aspects of the depicted character:

1\. Appearance/Attributes: e\.g\., hair, clothing, color, accessories\.

2\. Action/Pose/Gesture: The character’s current physical stance or action\.

3\. Expression/Emotion: The facial expression or implied emotion\.

4\. Context/Relevant Element: A background or foreground element directly interacting with or framing the character\.

5\. Fine Detail/Reasoning: A question requiring counting, relative positioning, or identifying a small, specific detail\.

\- Difficulty Mix: The set of 5 questions MUST have this exact distribution: 2 easy, 2 medium, 1 hard\.

Question and Options Requirements

\- Clarity: Each question must be clear and unambiguous\.

\- Format: Provide exactly 4 options: A, B, C, D\. Only ONE option can be correct\.

\- Plausible Distractors: Options must be mutually exclusive\. Incorrect options \(distractors\) should be plausible but clearly wrong based on visual evidence\. Avoid giveaway options like ”All of the above\.”

\- Negative Phrasing: Use ”None of the above” or ”Cannot be determined” at most once across all 5 questions, and only if it is the genuinely correct answer\.

Output Format

\- Your entire response MUST be a single, valid JSON array containing 5 question objects\. Do not include any text or explanation outside of the JSON array\.

\- Adhere strictly to this schema for each object in the array:

```
{
    "id": "q1" | "q2" | "q3" | "q4" | "q5",
    "question": "<Question grounded in the image>",
    "options": {
      "A": "<option A>",
      "B": "<option B>",
      "C": "<option C>",
      "D": "<option D>"
    },
    "answer_key": "A" | "B" | "C" | "D",
    "difficulty": "easy" | "medium" | "hard",
    "rationale": ""
  }
```

## Appendix GDetailed Experiment Setup

### G\.1More Implementation Details

For the unified SFT stage, we set the training sampling ratio for text\-to\-image \(T2I\) and image understanding tasks to 200:1\. We ensure that each batch includes at least one sample from each of the T2I and VLM datasets\. The model is optimized with AdamW\(Kingma and Ba,[2014](https://arxiv.org/html/2605.08129#bib.bib32)\)at a learning rate of 2e\-5, without warmup\.

In the Character\-GRPO stage, we utilize the Group Relative Policy Optimization \(GRPO\) algorithm adapted for flow\-matching models\. For each input prompt, we sample a group ofG=8G=8images to calculate the relative rewards\. We employ a relatively conservative optimization strategy with a learning rate of 1e\-5 and a batch size of 6\. Notably, we set the KL divergence coefficientβ\\betato 0, relying instead on a strict clipping mechanism to maintain training stability\. The clipping range parameters,ϵl​t\\epsilon\_\{lt\}andϵg​t\\epsilon\_\{gt\}, are both set to1×10−51\\times 10^\{\-5\}, ensuring that the policy updates remain within a very narrow trust region\. The training is conducted using BFloat16 mixed precision to balance numerical stability and computational efficiency\.

For the BAGEL model, we use different sampling step counts for training and evaluation:Nt​r​a​i​n=15N\_\{train\}=15steps for efficient reward computation during the GRPO loop, andNe​v​a​l=50N\_\{eval\}=50steps for high\-quality final synthesis\. We apply a guidance scale of 4\.0\. To enhance the stochasticity and structural preservation during personalization, we incorporate Stochastic Differential Equation \(SDE\) sampling with a noise level of 1\.3\. We conducted full\-parameter tuning, setting the SDE window size to 3\. The SDE window range is restricted to the first half of the denoising process \(\[0,⌊N/2⌋\]\[0,\\lfloor N/2\\rfloor\]\) to maintain global structure while allowing for fine\-grained detail refinement\.

The reward signalRRis a multi\-objective weighted sum designed to balance aesthetic quality, text alignment, and identity preservation\. The components are defined as follows:

- •CLIP Similarity Reward \(w=0\.45w=0\.45\):Measured by CLIP\-T score to ensure the generated image matches the prompt\.
- •VQA Consistency Reward \(w=0\.3w=0\.3\):Assessed using a VQA\-based scorer\.
- •Perceptual Diversity Reward \(w=0\.1w=0\.1\):Measured via LPIPS distance within the group ofGGimages to prevent mode collapse\.
- •Trainset Similarity Penalty \(w=0\.15w=0\.15\):We utilize a DINO\-based penalty to prevent overfitting\. Specifically, if the DINO similarity exceeds a high threshold \(0\.9\) or falls below a low threshold \(0\.5\), a penalty is applied\. This encourages the model to maintain the target identity without generating near\-identical replicas or losing the character’s essence\.

### G\.2LLM\-as\-Judge

For theText\-based Role Playtask, we employ an “LLM\-as\-Judge” methodology for comparison\. Specifically, we annotate a specialized set of test questions focused on “Memorization” and “Personality\.” Memorization refers to the model’s ability to accurately and comprehensively recall information pertinent to the character it is portraying, including associated people, events, and objects\. Personality assesses the model’s capacity to emulate the character’s thought processes and speech patterns, including linguistic style, tone, and emotional expression across various contexts\. However, we observe from training without extended dialogues that models tend to replicate template\-like responses from given examples\. Although such outputs may align with the character’s Personality, this is not an ideal scenario\. To quantify this capability, we provide the LLM with 20 sets of user inputs and model responses, asking it to assign a Diversity score that reflects the model’s ability to produce varied, non\-repetitive replies\.

Meanwhile, we use the Qwen3\(Team,[2025](https://arxiv.org/html/2605.08129#bib.bib52)\)LLM to evaluate the model’s role\-play text outputs on the Personality, Memorization, and Diversity metrics\.

For Personality and Memorization, each test question corresponds to a User Input–Model Response pair, yielding an individual score; the final score for each metric is the average across all test questions\.

For Diversity, the 20 test questions’ User Input–Model Response pairs are combined into a more comprehensive set of model–user interactions, which are jointly evaluated to produce a single Diversity score\.

All three scores range from 1 to 7\.

Following Character\-llm\(Shaoet al\.,[2023](https://arxiv.org/html/2605.08129#bib.bib11)\), our evaluation prompt is displayed in[Tables16](https://arxiv.org/html/2605.08129#A7.T16),[17](https://arxiv.org/html/2605.08129#A7.T17)and[18](https://arxiv.org/html/2605.08129#A7.T18)

Table 16:Prompt for Personality Metrics\.Prompt for Personality Metrics

You will be given responses written by an AI assistant mimicking the character \[Character Name\]\. Your task is to rate the performance of \[Character Name\] using the specific criterion by following the evaluation steps\. Below is the data:

\*\*\*

Profile:

\[Character Profile\]

Example Dialogues:

\[Example Dialogues\]

\*\*\*

Interactions:

\[Interactions\]

\*\*\*

Evaluation Criterion:

Personality \(1\-7\): Is the response reflects the personalities and preferences of the character?

\[Evaluation Steps\]

1\. Read through the profile, background, example dialogues, and write the personalities and preferences of the real character\.

2\. Read through the interactions and identify the personalities and preferences of the AI assistant\.

3\. After having a clear understanding of the interactions, compare the responses to the profile\. Look for any consistencies or inconsistencies\. Do the responses reflect the character’s personalities and preferences?

4\. Use the given scale from 1\-7 to rate how well the response reflects the personalities and preferences of the character\. 1 being not at all reflective of the character’s personalities, and 7 being perfectly reflective of the character’s personalities\.

\*\*\*

First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct\. Avoid simply stating the correct answers at the outset\. Then print the score on its own line corresponding to the correct answer\. At the end, repeat just the selected score again by itself on a new line\.

Table 17:Prompt for Memorization Metrics\.Prompt for Memorization Metrics

You will be given responses written by an AI assistant mimicking the character \[Character Name\]\. Your task is to rate the performance of \[Character Name\] using the specific criterion by following the evaluation steps\. Below is the data:

\*\*\*

Profile:

\[Character Profile\]

\*\*\*

Interactions:

\[Interactions\]

\*\*\*

Evaluation Criterion:

Factual Correctness \(1\-7\): Is the response provides truthful and detailed facts about the character?

\[Evaluation Steps\]

1\. Read through the profile and interaction and identify the key points related to the character\.

2\. Read through the responses of the AI assistant and compare them to the profile\. Check if the responses are consistent with the character’s profile, background, and known facts about the character\.

3\. Check whether the responses provide detailed facts about the character or if they are generic responses that could apply to any character\. Detailed responses are more factual and contribute positively to the score\.

4\. Rate the performance of the AI on a scale of 1\-7 for factual correctness, where 1 is the lowest and 7 is the highest based on the Evaluation Criteria\.

\*\*\*

First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct\. Avoid simply stating the correct answers at the outset\. Then print the score on its own line corresponding to the correct answer\. At the end, repeat just the selected score again by itself on a new line\.

Table 18:Prompt for Diversity Metrics\.Prompt for Diversity Metrics

You will be given multiple responses written by an AI assistant to different user inputs\. Your task is to rate the performance of the AI assistant on the diversity of its responses by following the evaluation steps\. Below is the data:

\*\*\*

Interactions \[Interactions\]

\*\*\*

\[Evaluation Criterion\]

Diversity \(1\-7\): How varied and non\-repetitive are the AI’s responses across different interactions?

\[Evaluation Steps\]

1\. Read through all the provided user inputs and the corresponding model responses\.

2\. Analyze the language, structure, and content of the model’s responses across the different interactions\. Identify any recurring phrases, sentence structures, or response templates\.

3\. Assess the degree of repetition\. Does the assistant provide unique, context\-specific answers for each input, or does it rely on a limited set of formulaic responses?

4\. Use the given scale from 1\-7 to rate the diversity of the responses\. 1 means the responses are highly repetitive and templated, while 7 means the responses are highly varied, creative, and tailored to each specific user input\.

\*\*\*

First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct\. Avoid simply stating the correct answers at the outset\. Then print the score on its own line corresponding to the correct answer\. At the end, repeat just the selected score again by itself on a new line\.

Similar Articles

Structured Role-Aware Policy Optimization for Multimodal Reasoning

arXiv cs.AI

This paper introduces Structured Role-Aware Policy Optimization (SRPO), a method that improves multimodal reasoning in Large Vision-Language Models by assigning token-level credit based on distinct perception and reasoning roles within reinforcement learning frameworks.

PersonaVLM: Long-Term Personalized Multimodal LLMs

Hugging Face Daily Papers

PersonaVLM introduces a personalized multimodal LLM framework that enables long-term user adaptation through memory retention, multi-turn reasoning, and response alignment, outperforming GPT-4o by 5.2% on the new Persona-MME benchmark.