VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

arXiv cs.CL 05/11/26, 04:00 AM Models
Summary
VITA-QinYu is an expressive end-to-end spoken language model capable of role-playing and singing, trained on 15.8K hours of data to outperform peers in expressiveness and conversational accuracy.
arXiv:2605.06765v1 Announce Type: new Abstract: Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role-playing and singing. We present VITA-QinYu, the first expressive end-to-end (E2E) spoken language model (SLM) that goes beyond natural conversation to support both role-playing and singing generation. VITA-QinYu adopts a hybrid speech-text paradigm that extends interleaved text-audio modeling with multi-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference. We further develop a comprehensive data generation pipeline to synthesize a total of 15.8K hours of natural conversation, role-playing, and singing data for training. VITA-QinYu demonstrates superior expressiveness, outperforming peer SLMs by 7 percentage points on objective role-playing benchmarks, and surpassing peer models by 0.13 points on a 5-point MOS scale for singing. Simultaneously, it achieves state-of-the-art conversational accuracy and fluency, exceeding prior SLMs by 1.38 and 4.98 percentage points on the C3 and URO benchmarks, respectively. We open-source our code and models and provide an easy-to-use demo with full-stack support for streaming and full-duplex interaction.
Original Article
View Cached Full Text
Cached at: 05/11/26, 06:39 AM
# VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
Source: [https://arxiv.org/html/2605.06765](https://arxiv.org/html/2605.06765)
###### Abstract

Human speech conveys expressiveness beyond linguistic content, including personality, mood, or performance elements, such as a comforting tone or humming a song, which we formalize as role\-playing and singing\. We presentVITA\-QinYu, the*first*expressive end\-to\-end \(E2E\) spoken language model \(SLM\) that goes beyond natural conversation to support both role\-playing and singing generation\.VITA\-QinYuadopts a hybrid speech–text paradigm that extends interleaved text–audio modeling with multi\-codebook audio tokens, a design enabling richer paralinguistic representation while preserving a clear separation between modalities to avoid interference\. We further develop a comprehensive data generation pipeline to synthesize a total of 15\.8K hours of natural conversation, role\-playing, and singing data for training\.VITA\-QinYudemonstrates superior expressiveness, outperforming peer SLMs by77percentage points on objective role\-playing benchmarks, and surpassing peer models by0\.130\.13points on a 5\-point MOS scale for singing\. Simultaneously, it achieves state\-of\-the\-art conversational accuracy and fluency, exceeding prior SLMs by 1\.38 and 4\.98 percentage points on the C3 and URO benchmarks, respectively\. We open\-source our code and models and provide an easy\-to\-use demo with full\-stack support for streaming and full\-duplex interaction\.

## 1Introduction

End\-to\-end \(E2E\) spoken language models \(SLMs\) have achieved strong progress in fluent and informative conversational abilities, with performance in understanding, reasoning, and instruction following approaching text\-only models\(Chenet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib48); Zhanget al\.,[2025b](https://arxiv.org/html/2605.06765#bib.bib29)\)\. However, human speech carries rich paralinguistic cues—such as prosody, intonation, rhythm, and style—that convey personality and emotion\. For instance, users may seek comforting speech or soft humming in specific situations\. We formalize these aspects as role\-playing and singing, viewing them as key forms of speech expressiveness\(Huanget al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib14)\), which remain underexplored in E2E SLMs\.

Existing expressive speech systems are largely task\-specific and do not support general conversational assistants\. Role\-playing systems\(Liet al\.,[2023](https://arxiv.org/html/2605.06765#bib.bib26); Wanget al\.,[2024c](https://arxiv.org/html/2605.06765#bib.bib25); Zhanget al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib24)\)typically adopt cascaded pipelines that combine LLM\-based text generation with external speech synthesis\. While modular, these approaches introduce significant engineering complexity due to their multi\-component design\. Traditional singing voice synthesis \(SVS\) methods rely on lyrics and musical scores\(Panet al\.,[2026](https://arxiv.org/html/2605.06765#bib.bib23)\), limiting their use in real\-world interactions where users only provide song or singer names\. This motivates a more general setting where singing must be generated from minimal natural\-language inputs\.

A comparison of recent LLMs and SLMs is shown in Table[1](https://arxiv.org/html/2605.06765#S1.T1)\. Motivated by these limitations, we proposeVITA\-QinYu, the first E2E SLM supporting expressive speech generation alongside natural conversation\.VITA\-QinYuadopts a hybrid speech–text paradigm, extending interleaved modeling\(Zenget al\.,[2024b](https://arxiv.org/html/2605.06765#bib.bib50)\)with parallel multi\-codebook audio token modeling\(Xie and Wu,[2024](https://arxiv.org/html/2605.06765#bib.bib43)\), improving paralinguistic expressivity while reducing cross\-modal interference\(Nguyenet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib49)\)\. As a native end\-to\-end system, it avoids the complexity of cascaded pipelines\.

To support expressive generation, we construct large\-scale datasets for role\-playing and singing\. Our 2\.6K\-hour role\-playing dataset covers 20K\+ roles, derived from audiobooks with structured character extraction and LLM\-generated interactive scripts, followed by instruction\-based expressive speech synthesis\. We also build a 1\.2K\-hour singing dataset by collecting trending songs, using MIDI\-guided zero\-shot SVS for high\-quality vocals, and converting song information into natural language instructions for conversational modeling\.

Table 1:Comparison of existing LLM and SLMs on speech modaldity \(Speech\), natural conversation \(Natural Conv\.\), role\-playing \(Role\-Play\), end\-to\-end architecture \(Arch\.\) and speech\-text modeling paradigm \(Paradigm\)\. “N/A” denotes “not applicable”\.We view role\-playing and singing as early steps toward broader expressive speech generation\. We hope this work provides a foundation for future research and will continue improvingVITA\-QinYufor these capabilities\.

Our contributions are summarized as follows:

- •We proposeVITA\-QinYu, the first E2E SLM with a hybrid text–speech paradigm supporting expressive role\-playing and singing while maintaining strong conversational ability\.
- •We construct 3\.8K hours of role\-playing and singing datasets to address gaps in expressive speech modeling\.
- •Experiments show thatVITA\-QinYuachieves strong expressiveness, outperforming prior SLMs on role\-playing and singing benchmarks, while also matching or exceeding state\-of\-the\-art conversational performance\.

## 2Related Works

Spoken Language Models \(SLMs\)E2E SLMs can be categorized by architecture and modeling paradigm\. Architecturally, they include native and aligned models\(Chenet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib48)\)\. Native SLMs\(Défossezet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib46); Xie and Wu,[2024](https://arxiv.org/html/2605.06765#bib.bib43); Zenget al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib47); Gaoet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib99); Longet al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib30); Zhanget al\.,[2025b](https://arxiv.org/html/2605.06765#bib.bib29)\)use a single decoder\-only Transformer for joint text–audio modeling, but struggle with modality gaps and limited pre\-training\. Aligned SLMs\(Fanget al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib51); Chenet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib48); Xuet al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib102);[b](https://arxiv.org/html/2605.06765#bib.bib103)\)adopt a “Thinker–Talker” two\-stage design to preserve reasoning\. Systems like Minmo\(Chenet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib48)\)and Qwen\-Omni\(Xuet al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib102);[b](https://arxiv.org/html/2605.06765#bib.bib103)\)decouple reasoning and speech generation, but rely on separate synthesis modules, often limiting paralinguistic expressivity\.

From a modeling perspective, parallel models\(Défossezet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib46); Xie and Wu,[2024](https://arxiv.org/html/2605.06765#bib.bib43); Chenet al\.,[2024b](https://arxiv.org/html/2605.06765#bib.bib53); Gaoet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib99); Dinget al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib105); Zhanget al\.,[2025b](https://arxiv.org/html/2605.06765#bib.bib29)\)use multi\-codebook audio tokens for richer acoustics but may weaken text–speech alignment\(Nguyenet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib49)\), while interleaved models\(Zenget al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib47); Longet al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib30); Liet al\.,[2025b](https://arxiv.org/html/2605.06765#bib.bib104)\)alternate text and speech tokens for better linguistic consistency but rely on simpler audio representations and extra decoders for prosody\. Extensions such as Baichuan\-Audio\(Liet al\.,[2025b](https://arxiv.org/html/2605.06765#bib.bib104)\)combine both ideas with more complex decoding pipelines\.VITA\-QinYusimplifies this design by replacing the flow\-matching decoder with lightweight MLP heads, encouraging more unified text–audio modeling\.

Audio TokenizerThe architectural choice of an audio tokenizer determines the trade\-off between reconstruction fidelity, paralinguistic expressiveness, and inference efficiency\. Residual Vector Quantization based decodersDéfossezet al\.\([2024](https://arxiv.org/html/2605.06765#bib.bib46)\); Yeet al\.\([2025](https://arxiv.org/html/2605.06765#bib.bib131)\); Wanget al\.\([2025b](https://arxiv.org/html/2605.06765#bib.bib7)\); Siuzdaket al\.\([2024](https://arxiv.org/html/2605.06765#bib.bib54)\); Gonget al\.\([2025](https://arxiv.org/html/2605.06765#bib.bib17)\)represent audio through multiple codebooks\. These codebooks naturally capture rich paralinguistic information, such as speaker identity and prosody\. Since the representation is highly descriptive, it places less computational demand on the decoder; a simple CNN\-based decoder is often sufficient to reconstruct high\-quality audio with low latency\. In contrast, models like CosyVoice2\(Duet al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib41)\)and GLM\-4\-Voice\(Zenget al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib47)\)rely on single\-codebook semantic tokens\. While these tokens are highly compressed for semantic efficiency, they often lead to the loss of paralinguistic details\. In preliminary experiments, we find that these tokenizers fail to reconstruct the melody of the original singing voice\.

Role\-Playing ModelsRecent advances in LLMs enable strong role\-playing capabilities\(Chenet al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib15)\), enabling immersive character simulation\. However, most speech role\-playing systems remain cascaded\. For example, ChatHaruhi\(Liet al\.,[2023](https://arxiv.org/html/2605.06765#bib.bib26)\)generates role\-consistent text via an LLM and relies on external TTS for speech\. OmniCharacter\(Zhanget al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib24)\)encodes user queries with Whisper\(Radfordet al\.,[2023](https://arxiv.org/html/2605.06765#bib.bib59)\), aligns them with a Qwen2\.5\-7B\-Instruct\(Yanget al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib40)\)backbone to produce text, then uses a separate speech LLM and synthesis module to generate role\-aware speech\.

Singing Voice Synthesis ModelsTraditional singing voice synthesis \(SVS\) generates high\-fidelity vocals from lyrics and scores\(Panet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib10)\), with recent advances improving quality and modeling\. VISinger\(Zhanget al\.,[2022b](https://arxiv.org/html/2605.06765#bib.bib12)\), based on VITS\(Kimet al\.,[2021](https://arxiv.org/html/2605.06765#bib.bib11)\), enables end\-to\-end SVS; Toksing\(Wuet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib9)\)uses a non\-autoregressive LM over quantized representations; and HiddenSinger\(Hwanget al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib8)\)applies latent diffusion on neural codecs\. However, most SVS systems depend on structured inputs \(e\.g\., MIDI\), limiting use in interactive settings where users provide only natural language\.

## 3Methods

![Refer to caption](https://arxiv.org/html/2605.06765v1/x1.png)Figure 1:Architecture overview ofVITA\-QinYu\. For text input, the LLM directly consumes embeddings; for speech input, a speaker module extracts speaker embeddings and an audio encoder extracts continuous features\. An additional agent speaker embedding controls response timbre\. Conditioned on these signals, the LLM generates interleaved text and multi\-codebook audio tokens\. Audio tokens are temporally shifted for quality, averaged back into the model for the next step, and decoded into waveforms\. During training, the speaker and audio encoders are frozen, while only the adapters and LLM are updated\.![Refer to caption](https://arxiv.org/html/2605.06765v1/x2.png)\(a\)Multi\-turn conversation\.
![Refer to caption](https://arxiv.org/html/2605.06765v1/x3.png)\(b\)Agent speaker\.
![Refer to caption](https://arxiv.org/html/2605.06765v1/x4.png)\(c\)Interruption\.

Figure 2:Logic of multi\-turn conversation, agent speaker generation and interruption\.The overview ofVITA\-QinYu’s architecture is shown in Figure[1](https://arxiv.org/html/2605.06765#S3.F1), which consists of an audio encoder, an audio adapter, a speaker embedding module, a language model backbone, and eight language\-modeling heads\. Additionally, a text\-to\-timbre \(TTT\) module is integrated into the system for role\-playing tasks\. Detailed descriptions of each component are provided in the following sections\.

Backbone ModelThe backbone ofVITA\-QinYuis a decoder\-only Transformer based language model \(LM\)\. We experiment with Qwen3\-8B\(Huet al\.,[2026](https://arxiv.org/html/2605.06765#bib.bib113)\)and Youtu\-LLM\-4B\(Luet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib19)\), resulting in two variants:VITA\-QinYu\-8B andVITA\-QinYu\-4B\. The backbone model processes users’ queries, whether in speech or text, and generates both text and audio responses in a hybrid paradigm, which we formulate as follows\.

Denote the user’s input asX∈𝒳X\\in\\mathcal\{X\}, where𝒳\\mathcal\{X\}is the joint space for text and speech embeddings\. Denote the model’s text response and speech response asY∈𝒱Y\\in\\mathcal\{V\}andZ∈𝒰Z\\in\\mathcal\{U\}respectively, where𝒱\\mathcal\{V\}is the text vocabulary set and𝒰\\mathcal\{U\}is the speech codec vocabulary set\. In the multi\-codebook setting withJJcodebooks, we have𝒰=∪j=0J−1𝒰j\\mathcal\{U\}=\\cup\_\{j=0\}^\{J\-1\}\\mathcal\{U\}^\{j\}and the speech tokensZZcan be multi\-codebook tokens stacked in parallel:Z=\[Zj\]j=0J−1Z=\[Z^\{j\}\]\_\{j=0\}^\{J\-1\}, where the speech tokenZj∈𝒰jZ^\{j\}\\in\\mathcal\{U\}^\{j\}belongs thejj\-th codebook vocabulary set𝒰j\\mathcal\{U\}^\{j\}\. We interleave the text and speech response tokens with a predefined ratio ofn:mn:minto a new sequenceSSas follows:

S=\[Y0:n−1,Z0:m−1,Yn:2n−1,Zm:2m−1,…\],S=\[Y\_\{0:n\-1\},Z\_\{0:m\-1\},Y\_\{n:2n\-1\},Z\_\{m:2m\-1\},\\dots\],\(1\)where the text tokens and speech tokens are alternated in blocks of sizennandmm, respectively\. Once the text tokens are consumed, the remaining speech tokens are appended to the end of the sequence\. Denote the dataset as𝒟=\{\(Xi,Si\)\}i=1D\\mathcal\{D\}=\\\{\(X\_\{i\},S\_\{i\}\)\\\}\_\{i=1\}^\{D\}, whereDDis the number of samples in the dataset\. The negative log\-likelihoodℒ\\mathcal\{L\}over the dataset𝒟\\mathcal\{D\}can be modeled as:

ℒ=\\displaystyle\\mathcal\{L\}=∑i=1D∑t=1Tilog⁡P\(Si,t\|Xi,Si,<t\)\\displaystyle\\sum\_\{i=1\}^\{D\}\\sum\_\{t=1\}^\{T\_\{i\}\}\\log P\(S\_\{i,t\}\|X\_\{i\},S\_\{i,<t\}\)\(2\)whereTiT\_\{i\}is the length of the interleaved sequenceSiS\_\{i\}\. WhenSt∈𝒱S\_\{t\}\\in\\mathcal\{V\}is a text token, we compute the conditional log\-probability the same as in the conventional LLM approach\. WhenSt∈𝒰S\_\{t\}\\in\\mathcal\{U\}is the stacked speech tokens, the log\-probability is modeled using the average log\-probability of speech tokens acrossJJcodebooks\. Formally, the log\-probabilitylog⁡P\(St\|X,S<t\)\\log P\(S\_\{t\}\|X,S\_\{<t\}\)is computed as:

log⁡P\(St\|X<t,S<t\)=\{log⁡P\(Y\|X,S<t\),ifStis text:St=Y,1J∑j=0J−1log⁡P\(Zj\|X,S<t\)ifStis speech:St=\[Zj\]j=0J−1,\\log P\(S\_\{t\}\|X\_\{<t\},S\_\{<t\}\)=\\begin\{cases\}\\log P\(Y\|X,S\_\{<t\}\),&\\text\{if $S\_\{t\}$ is text: $S\_\{t\}=Y$,\}\\\\ \\frac\{1\}\{J\}\\sum\_\{j=0\}^\{J\-1\}\\log P\(Z^\{j\}\|X,S\_\{<t\}\)&\\text\{if $S\_\{t\}$ is speech: $S\_\{t\}=\[Z^\{j\}\]\_\{j=0\}^\{J\-1\},$\}\\end\{cases\}\(3\)where the subscriptiiis omitted for the sake of clarity\.

Multi\-Turn ConversationWe prepend the conversation history to the LLM’s input to support multi\-turn interactions\. The user’s query, whether in text or speech, is included as\-is\. Since speech responses are often lengthy and largely redundant with the corresponding text responses, we discard the speech and retain only the text response in the history context, as illustrated in Figure[2\(a\)](https://arxiv.org/html/2605.06765#S3.F2.sf1)\.

Audio Encoder and AdapterWe use SenseVoiceSmall\(Anet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib18)\)to encode input speech into16\.716\.7Hz continuous features with a hidden size of560560and an MLP\-based adapter to align the SenseVoiceSmall’s output with the backbone language model\. We keep the audio encoder frozen and train only the adapter\.

Speech DecoderFollowing parallel models\(Xie and Wu,[2024](https://arxiv.org/html/2605.06765#bib.bib43); Gaoet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib99)\), we adopt the multi\-codebook XY\-Tokenizer\(Gonget al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib17)\), which encodes speech into eight 12\.5 Hz codebooks \(100 Hz total\)\. It outperforms single\-codebook tokenizers \(e\.g\., CosyVoice\(Duet al\.,[2024b](https://arxiv.org/html/2605.06765#bib.bib6)\), GLM\-4\-Voice\(Zenget al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib47)\)\) in reconstructing both speech and singing\. We average tokens per frame as LLM input, extend the LM head to predict text or the first audio layer, and add seven heads for the remaining layers\. We find that a one\-token delay per layer \(Figure\.[1](https://arxiv.org/html/2605.06765#S3.F1)\) improves speech quality; without it, speech is intelligible but overly fast\.

Speaker Encoder and Timbre GenerationLLM backbones can learn speaker embeddings for better timbre generalization\(Zhouet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib110); Duet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib109); Huet al\.,[2026](https://arxiv.org/html/2605.06765#bib.bib113)\)\. To enable diverse timbre control, we inject CAM\+\+\(Wanget al\.,[2023](https://arxiv.org/html/2605.06765#bib.bib115)\)speaker embeddings into the LLM during training \(Figure\.[2\(b\)](https://arxiv.org/html/2605.06765#S3.F2.sf2)\)\. We compute averaged agent embeddings from 100 samples per voice and extract user embeddings from queries\. At inference, we adopt the Text\-to\-Timbre module from DeepDubbing\(Daiet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib107)\), which uses CFM\(Lipmanet al\.,[2023](https://arxiv.org/html/2605.06765#bib.bib114)\)to generate agent embeddings from character descriptions\.

Full\-Duplex InteractionWe use SileroVAD\(Silero\-Team,[2024](https://arxiv.org/html/2605.06765#bib.bib5)\)for voice activity detection and Whisper\(Radfordet al\.,[2023](https://arxiv.org/html/2605.06765#bib.bib59)\)for ASR\. Non\-empty transcripts are processed by the turn detector TEN\(TEN\-Team,[2025](https://arxiv.org/html/2605.06765#bib.bib13)\); when it signals “Finished,” audio is sent toVITA\-QinYufor response generation, otherwise the system continues listening \(Figure\.[2\(c\)](https://arxiv.org/html/2605.06765#S3.F2.sf3)\)\.

## 4Data Collection

### 4\.1Pretraining Data

FollowingLonget al\.\([2025b](https://arxiv.org/html/2605.06765#bib.bib100)\), we collect open\-source ASR, TTS, SQA, and text data for pretraining, and further augment it with in\-house datasets\. Details are provided in the Appendix\.

### 4\.2Conversational Data

To endow our model with both fundamental conversational skills and more natural, colloquial expression patterns, we constructed a General Conversational Dataset and a High\-Quality Colloquial Speech Dataset\. The conversation data pipeline is illustrated in Figure[3\(a\)](https://arxiv.org/html/2605.06765#S4.F3.sf1)\.

![Refer to caption](https://arxiv.org/html/2605.06765v1/x5.png)\(a\)Conversational Data Pipeline\.
![Refer to caption](https://arxiv.org/html/2605.06765v1/x6.png)\(b\)Role\-Playing Data Pipeline\.
![Refer to caption](https://arxiv.org/html/2605.06765v1/x7.png)\(c\)Singing Data Pipeline\.

Figure 3:Data pipelines for natural conversation, role\-playing and singing\.General Conversational DatasetWe source raw text data from AudioQA\-1\.0M\(Gaoet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib99)\), Voice Assistant 400K\(Xie and Wu,[2024](https://arxiv.org/html/2605.06765#bib.bib43)\), and Infinity\-Instruct\(Liet al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib111)\)\. To ensure data quality, we restricted the maximum token length to 800 to exclude excessively long responses\. Simultaneously, we filter out content deviating from conversational scenarios, such as complex mathematical formulas and code snippets\. For each response turn, we utilize DeepSeek to evaluate the content based on topic relevance, tone, and fluency\. Scores were assigned on a scale of 0 to 1, and low\-scoring dialogues were discarded\. Ultimately, we curated 800K Chinese and English samples from AudioQA\-1\.0M, 300K Chinese samples from Infinity\-Instruct, and 450K English samples from Voice Assistant 400K\.

High\-Quality Colloquial DatasetAlthough current E2E SLMs support basic interaction, their responses are often unnatural due to limited prosodic variation\. To address this, we select high\-scoring samples from our general conversational dataset and refine the responses\. Inspired by OpenS2S\(Wanget al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib112)\), we place a strong emphasis on the emotional nuances inherent in human conversation and the flexibility of emotional expression\. We leverage DeepSeekLiuet al\.\([2024](https://arxiv.org/html/2605.06765#bib.bib1)\)to analyze potential underlying emotions within the dialogue text and generate corresponding colloquial expressions based on these emotions, thereby further enhancing the human\-like quality of the interactions\. Consequently, we collect and rewrite 400K textual dialogue samples featuring diverse emotional expressions\.

User speech input is naturally diverse, influenced by factors such as age, gender, and accents\. To improve model robustness, we simulate realistic user queries by extracting over 90K unique speaker prompts from ASR datasets \(e\.g\., Aishell and Common Voice\)\. These queries are synthesized using an in\-house zero\-shot TTS model\. For the response side, we utilize a speaker fine\-tuned TTS model to ensure high\-quality audio output\.

### 4\.3Role\-playing Data

The role\-playing pipeline synthesizes high\-fidelity multi\-turn dialogs, enabling models to capture vocal context and maintain stable timbre\. As shown in Figure[3\(b\)](https://arxiv.org/html/2605.06765#S4.F3.sf2), it has three steps\.\(1\) Character Profiles Collection and Formalization: Despite often deviating from the interactive dialog structures, audiobooks contain diverse characters with distinct personalities, making them an ideal source of character profiles\. Inspired by Deep Dubbing\(Daiet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib107)\), we use LLMs to annotate large\-scale audiobook corpora, producing a diverse library of over 20K unique character profiles, each associated with a distinct vocal timbre and defined by four key attributes: role demographics, social identity, behavioral temperament, and acoustic vocal style\.\(2\) Character Multi\-turn Script Construction: We design an attribute\-constrained prompting strategy to construct high\-quality role\-playing scripts\. Our generation prioritizes the alignment of persona, scenario, and linguistic style\. For example, a character defined as a “gentle snack shop owner” may be instantiated in a “casual purchase conversation\.” Specialized prompts based on personality descriptions are employed to ensure that generated responses remain consistent with the specified gentle temperament\. To support multi\-turn interactions, each session includes 8–15 dialogue turns\. This process yields a corpus of 80K role\-playing scripts spanning diverse scenarios\.\(3\) Instruction\-Based Expressive Speech Synthesis: A key challenge in speech synthesis is the expressivity gap–mapping implicit text sentiment to explicit prosody\. To address this, we use LLMs to add instruction tags to role\-playing scripts and apply in\-house instruct TTS models inspired byYanget al\.\([2024b](https://arxiv.org/html/2605.06765#bib.bib108)\); Duet al\.\([2025](https://arxiv.org/html/2605.06765#bib.bib109)\); Zhouet al\.\([2025](https://arxiv.org/html/2605.06765#bib.bib110)\)to generate role\-playing speech with fine\-grained prosodic and emotional control\.

### 4\.4Singing Data

Most SVS systems\(Panet al\.,[2026](https://arxiv.org/html/2605.06765#bib.bib23)\)are score\- or melody\-controlled, limiting seamless human–agent singing interaction\. To enable unconstrained interaction, we propose a scalable three\-step pipeline for constructing the instruction\-based singing dataset, as shown in Figure[3\(c\)](https://arxiv.org/html/2605.06765#S4.F3.sf3)\.\(1\) Song Collection: To ensure our model captures a comprehensive distribution of contemporary vocal interactions, we curate a diverse corpus of 5K popular musical compositions\. We perform structural decomposition on these tracks to identify fine\-grained segments—such as verses and choruses\. This temporal partitioning facilitates precise and localized control over the generative process\. Subsequently, we extract symbolic MIDI and word durations as a reference for the SVS system to maintain pitch and rhythmic consistency\.\(2\) Snippets Generation: We employ an in\-house zero\-shot SVS system to generate vocal snippets for each song, conditioned on the extracted MIDI and structural information from the previous stage\. The synthesis is additionally conditioned on target speaker embeddings to ensure high\-fidelity vocals that are both natural and timbre\-consistent\.\(3\) Instruction Design: We systematically enumerate various instruction formats, ranging from artist\-specific queries to requests for particular song segments\. We then leverage LLMs to paraphrase these templates into natural and colloquial questions\. Finally, following the established pipeline for conversational data, all textual instructions are converted into query speech\.

## 5Experiments

We evaluateVITA\-QinYuin role\-playing, singing, and natural conversation benchmarks\. In addition, training details,VITA\-QinYu’s ASR and TTS results, ablation studies on speech tokenization and speaker injection are provided in Appendix\.

### 5\.1Role\-playing

Table 2:Objective and subjective results for role\-playing in text and speech\. Text is evaluated on Character Consistency \(CST\), Conversation Ability \(CNV\), and Attractiveness \(ATR\); speech is assessed by Speaker Similarity \(SS\) against ground truth\. Overall performance is averaged \(Avg\)\. Subjective speech evaluation includes Character Matching \(CMS\), Naturalness \(MOS\-N\), and Emotion \(MOS\-E\), all on a 5\-point scale\.ModelObjectiveSubjectiveCSTCNVATRSSAvg\.CMSMOS\-NMOS\-EBaichuan2\-7B48\.2673\.9453\.45\-\-\-\-\-Qwen3\-8B50\.3774\.8255\.28\-\-\-\-\-Qwen\-7B\-Chat47\.3372\.4452\.44\-\-\-\-\-XVERSE\-7B\-Chat48\.4774\.0753\.60\-\-\-\-\-GLM\-9B\-Chat50\.3475\.4255\.20\-\-\-\-\-Qwen2\.5\-7B\-Instruct49\.0574\.8054\.27\-\-\-\-\-Qwen2\.5\-Omini50\.2473\.5255\.1238\.9054\.452\.213\.323\.16KimiAudio50\.4875\.3055\.2235\.1854\.042\.173\.643\.38VITA\-QinYu\-4B48\.6073\.9053\.8264\.0760\.133\.543\.844\.08VITA\-QinYu\-8B50\.3475\.4855\.3164\.7061\.453\.453\.743\.99

We evaluate textual role\-playing performance on the CharacterEval benchmark\(Tuet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib132)\)\. The assessment covers three dimensions: CharacterConsistency,ConversationAbility , and Role\-playingAttractiveness\. We construct character profiles from audiobook data, which are typically more concise and lack extensive background information compared to traditional role\-playing corpora\. An LLM is used to select conversations corresponding to specific evaluation dimensions, and role\-playing–oriented models from the CharacterEval benchmark are employed to score performance along these dimensions\. Additionally, we evaluate speech performance using speaker similarity scores between generated speech and ground\-truth responses\.

The baselines comprise general\-purpose LLM such as Baichuan2\-7B\(Yanget al\.,[2023](https://arxiv.org/html/2605.06765#bib.bib133)\)and Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib134)\), chat agents such as Qwen\-7B\-Chat\(Qwen,[2023](https://arxiv.org/html/2605.06765#bib.bib136)\), GLM\-9B\-Chat\(Glmet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib135)\)and XVERSE\-7B\-Chat\(XVERSE\-7B,[2023](https://arxiv.org/html/2605.06765#bib.bib137)\), role\-playing specialized model such as Qwen2\.5\-7B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib138)\)\. All these baselines are text\-only models and lack speech generation capability\. We further include SLMs, such as Qwen2\.5\-OmniQwenet al\.\([2025](https://arxiv.org/html/2605.06765#bib.bib138)\)and Kimi\-AudioKimiTeamet al\.\([2025](https://arxiv.org/html/2605.06765#bib.bib27)\), for comparison\. Notably, none of the peer models support dynamic timbre control, which may limit the fairness of the comparison\. The results are shown in Table[2](https://arxiv.org/html/2605.06765#S5.T2)\. The original Consistency, Conversation, and Attractiveness scores are measured on a 5\-point scale and are linearly rescaled to a 100\-point scale to align with speaker similarity scores\. We observe that our method achieves competitive text\-level role\-playing performance relative to conventional role\-playing models\. In particular,VITA\-QinYu\-8B demonstrates strong performance in both conversational ability and attractiveness\. BothVITA\-QinYu\-4B andVITA\-QinYu\-8B achieve speaker similarity scores of approximately64%64\\%, indicating a reasonable alignment between synthesized timbre and the target role description\. The higher speaker similarity compared to Qwen2\.5\-Omni and Kimi\-Audio is primarily due to their lack of support for role\-conditioned timbre control\. Detailed textual evaluations are shown in Appendix\.

Additionally, at the speech level, we conduct subjective mean opinion score \(MOS\) evaluations to assess the quality of role\-playing interactions\. Five annotators rate each sample on a55\-point scale\. The evaluation framework measures performance along three dimensions: character matching score\(Daiet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib107)\)\(CMS\), which assesses the consistency between synthesized speech timbre and user\-defined profiles; perceptual naturalness \(MOS\-N\), which assesses acoustic fidelity; and response emotion \(MOS\-E\), which assesses the appropriateness of the expressed emotion\. As shown in Table[2](https://arxiv.org/html/2605.06765#S5.T2), the results demonstrate strong MOS performance for role\-playing\. In particular,VITA\-QinYu\-4B andVITA\-QinYu\-8B achieve a CMS of approximately3\.53\.5, indicating good alignment between the generated voice and user expectations\. It also attains high MOS\-N and MOS\-E scores, suggesting that the synthesized speech is both natural and emotionally appropriate\. These results indicate that the proposed method effectively aligns both content and timbre with the target character profile\. The scores of Qwen2\.5\-Omni and Kimi\-Audio are included for reference as fixed\-timbre SLMs\.

### 5\.2Singing

For objective evaluation, we employ the latest FunASR model\(Anet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib141)\)to transcribe generated singing and compute the word error rate \(WER\) for assessing pronunciation accuracy\. We further useSingMOS\(Tanget al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib139)\)andSheet\-SSQA\(Huanget al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib140)\)to evaluate the perceptual quality of the generated singing\. For subjective evaluation, three annotators assess100100randomly sampled singing generation requests, from which MOS scores are computed\. We adopt standard SVS metrics, including sound quality \(Qua\.\), pronunciation clarity \(Clar\.\), naturalness \(Nat\.\), and expressiveness \(Expr\.\)\.VITA\-QinYuis compared with an SVS model conditioned on the corresponding MIDI files in both objective and subjective evaluations, and with Doubao soul singer in subjective evaluations\.

Table 3:Objective and subjective evaluation results on singing\. Objective metrics include SingMOS, Sheet\-SSQA, and WER, where SingMOS and Sheet\-SSQA scores are for singing voice perceptual quality, and WER is for pronunciation accuracy\. Subjective metrics include sound quality \(Qua\.\), pronunciation clarity \(Clar\.\), naturalness \(Nat\.\), and expressiveness \(Expr\.\)\. All scores are reported on a 5\-point scale\.As shown in Table[3](https://arxiv.org/html/2605.06765#S5.T3),VITA\-QinYuachieves singing quality–measured by SingMOS and Sheet\-SSQA scores–comparable to specialized in\-house SVS models\. It attains a WER of approximately0\.20\.2, indicating reasonable singing accuracy\. However, pronunciation clarity, audio quality, and expressiveness remain inferior to those of the SVS model, likely due to its reliance on MIDI guidance and the absence of song\-specific training for our audio tokenizer\. Overall, these results demonstrate the feasibility of integrating singing capabilities into E2E SLMs\. Compared with Doubao Soul Singer,VITA\-QinYuachieves higher overall MOS across all four subjective dimensions, indicating superior singing voice quality\.

### 5\.3Natural Conversation Benchmark

We evaluate the natural conversation ability ofVITA\-QinYuon two spoken dialogue benchmark: C3 Benchmark\(Maet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib20)\)and URO\-Bench\(Yanet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib22)\)\.

Table 4:Results on C3 Benchmarks\. The best results are highlighted inbold, and the second\-best areunderlined\.C3 Benchmark is a bilingual benchmark for SLMs in evaluating their complex conversation capabilities\. It decomposes conversational complexity into two dimensions: ambiguity and context dependency\. Ambiguity is further evaluated through two subtests—phonological ambiguity and semantic ambiguity—while context dependency consists of three subtests: omission, coreference, and multi\-turn interaction\. The ambiguity and context\-dependency scores are computed by averaging their respective subtests, and the overall score is obtained by averaging across all five subtests\. We compareVITA\-QinYuagainst Step\-Audio\(Huanget al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib14)\), Qwen2\.5\-Omni\(Xuet al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib102)\), Kimi\-Audio\(Dinget al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib105)\)and GLM\-4\-Voice\(Zenget al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib47)\)\. The results are shown in Table[4](https://arxiv.org/html/2605.06765#S5.T4)\. We observe thatVITA\-QinYu\-8B achieves the second\-highest overall score on English, following Qwen2\.5\-Omni, and the highest on Chinese\.VITA\-QinYu\-8B performs better on resolving context\-dependency related tasks but is weaker at handling ambiguity\-related questions\.

Table 5:Results on URO Benchmarks\. “U\.”, “C\.”, “R\.”, and “Avg\.” denote “Understanding”, “Conversation”, “Reasoning”, and “Average” respectively\.We evaluateVITA\-QinYuon URO\-Bench, which comprises three subtests: Understanding, Oral Conversation, and Reasoning\. In addition to reporting subtest results, we compute the average score across all three\. We compareVITA\-QinYuwith Freeze\-Omni, Qwen2\.5\-Omni, Kimi\-Audio, GLM\-4\-Voice, and Chroma\(Chenet al\.,[2026](https://arxiv.org/html/2605.06765#bib.bib3)\)\. Notably, Chroma is a recent end\-to\-end SLM designed for voice cloning and natural conversation and has a model scale comparable toVITA\-QinYu\-4B\. The results are presented in Table[5](https://arxiv.org/html/2605.06765#S5.T5)\. We observe thatVITA\-QinYu\-8B ranks first in both English and Chinese among the compared models\.VITA\-QinYu\-4B lags approximately77percentage points behindVITA\-QinYu\-8B, ranking third in English, with an average score1\.151\.15percentage points lower than Qwen2\.5\-Omni and1\.071\.07percentage points lower than GLM\-4\-Voice and ranking second in Chinese, with a2\.852\.85percentage point gap to GLM\-4\-Voice\. Despite a similar model scale,VITA\-QinYu\-4B outperforms Chroma by11\.4411\.44percentage points in English\.

## 6Conclusion

In this work, we presentVITA\-QinYu, the first end\-to\-end spoken language model that is capable of not only natural conversations but also expressive speech generation, including role\-playing and singing\.VITA\-QinYuadopts a novel hybrid text\-speech modeling approach that enables the native end\-to\-end learning of rich paralinguistic features\. Both subjective and objective evaluations demonstrate thatVITA\-QinYuachieves state\-of\-the\-art performance on spoken dialog benchmarks, while exhibiting strong singing and role\-playing abilities\. The role\-playing and singing capabilities introduced in this work represent early exploratory efforts toward broader expressive speech generation\. We hope that this study can provide a useful starting point for future research\.

#### Author Contributions

We would like to express our sincere gratitude to all contributors, including those not listed in the paper, for their invaluable support and efforts\. The contributors are listed in no particular order\.

Contributors Jiacheng Xu1Heting Gao2Liufei Xie1Zhenchuan Yang1Lijiang Li3Yiting Chen1Bin Zhang1Meng Chen1Chaoyu Fu3Weifeng Zhao1Wenjiang Zhou1

Affiliations 1TME Lyra Lab2Tencent YouTu Lab3Nanjing University

## References

- K\. An, Q\. Chen, C\. Deng, Z\. Du, C\. Gao, Z\. Gao, Y\. Gu, T\. He, H\. Hu, K\. Hu,et al\.\(2024\)Funaudiollm: voice understanding and generation foundation models for natural interaction between humans and llms\.arXiv preprint arXiv:2407\.04051\.Cited by:[§3](https://arxiv.org/html/2605.06765#S3.p5.2)\.
- K\. An, Y\. Chen, Z\. Chen, C\. Deng, Z\. Du, C\. Gao, Z\. Gao, B\. Gong, X\. Li, Y\. Li,et al\.\(2025\)Fun\-asr technical report\.arXiv preprint arXiv:2509\.12508\.Cited by:[§5\.2](https://arxiv.org/html/2605.06765#S5.SS2.p1.1)\.
- P\. Anastassiou, J\. Chen, J\. Chen, Y\. Chen, Z\. Chen, Z\. Chen, J\. Cong, L\. Deng, C\. Ding, L\. Gao, M\. Gong, P\. Huang, Q\. Huang, Z\. Huang, Y\. Huo, D\. Jia, C\. Li, F\. Li, H\. Li, J\. Li, X\. Li, X\. Li, L\. Liu, S\. Liu, S\. Liu, X\. Liu, Y\. Liu, Z\. Liu, L\. Lu, J\. Pan, X\. Wang, Y\. Wang, Y\. Wang, Z\. Wei, J\. Wu, C\. Yao, Y\. Yang, Y\. Yi, J\. Zhang, Q\. Zhang, S\. Zhang, W\. Zhang, Y\. Zhang, Z\. Zhao, D\. Zhong, and X\. Zhuang \(2024\)Seed\-tts: a family of high\-quality versatile speech generation models\.ArXivabs/2406\.02430\.External Links:[Link](https://api.semanticscholar.org/CorpusID:270226353)Cited by:[Appendix D](https://arxiv.org/html/2605.06765#A4.p3.1)\.
- R\. Ardila, M\. Branson, K\. Davis, M\. Kohler, J\. Meyer, M\. Henretty, R\. Morais, L\. Saunders, F\. Tyers, and G\. Weber \(2020\)Common voice: a massively\-multilingual speech corpus\.InProceedings of the twelfth language resources and evaluation conference,pp\. 4218–4222\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1)\.
- K\. Baba, W\. Nakata, Y\. Saito, and H\. Saruwatari \(2024\)The t05 system for the VoiceMOS Challenge 2024: transfer learning from deep image classifier to naturalness MOS prediction of high\-quality synthetic speech\.InIEEE Spoken Language Technology Workshop \(SLT\),pp\. 818–824\.External Links:[Document](https://dx.doi.org/10.1109/SLT61566.2024.10832315)Cited by:[Appendix F](https://arxiv.org/html/2605.06765#A6.p2.1)\.
- H\. Bu, J\. Du, X\. Na, B\. Wu, and H\. Zheng \(2017\)Aishell\-1: an open\-source mandarin speech corpus and a speech recognition baseline\.In2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment \(O\-COCOSDA\),pp\. 1–5\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1),[Appendix D](https://arxiv.org/html/2605.06765#A4.p1.1)\.
- G\. Chen, S\. Chai, G\. Wang, J\. Du, W\. Zhang, C\. Weng, D\. Su, D\. Povey, J\. Trmal, J\. Zhang,et al\.\(2021\)Gigaspeech: an evolving, multi\-domain asr corpus with 10,000 hours of transcribed audio\.arXiv preprint arXiv:2106\.06909\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1)\.
- J\. Chen, X\. Wang, R\. Xu, S\. Yuan, Y\. Zhang, W\. Shi, J\. Xie, S\. Li, R\. Yang, T\. Zhu,et al\.\(2024a\)From persona to personalization: a survey on role\-playing language agents\.arXiv preprint arXiv:2404\.18231\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p4.1)\.
- Q\. Chen, Y\. Chen, Y\. Chen, M\. Chen, Y\. Chen, C\. Deng, Z\. Du, R\. Gao, C\. Gao, Z\. Gao,et al\.\(2025\)MinMo: a multimodal large language model for seamless voice interaction\.arXiv preprint arXiv:2501\.06282\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p3.1),[§1](https://arxiv.org/html/2605.06765#S1.p1.1),[§2](https://arxiv.org/html/2605.06765#S2.p1.1)\.
- T\. Chen, T\. Chen, K\. Shen, Z\. Bao, Z\. Zhang, M\. Yuan, and Y\. Shi \(2026\)FlashLabs chroma 1\.0: a real\-time end\-to\-end spoken dialogue model with personalized voice cloning\.External Links:[Link](https://api.semanticscholar.org/CorpusID:284860926)Cited by:[§5\.3](https://arxiv.org/html/2605.06765#S5.SS3.p3.5)\.
- W\. Chen, Z\. Ma, R\. Yan, Y\. Liang, X\. Li, R\. Xu, Z\. Niu, Y\. Zhu, Y\. Yang, Z\. Liu,et al\.\(2024b\)SLAM\-omni: timbre\-controllable voice interaction system with single\-stage training\.arXiv preprint arXiv:2412\.15649\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p2.1)\.
- Z\. Dai, Y\. Chen, J\. Xu, L\. Xie, Y\. Wang, Z\. Yang, B\. Bai, Y\. Gao, W\. Zhou, W\. Zhao,et al\.\(2025\)Deep dubbing: end\-to\-end auto\-audiobook system with text\-to\-timbre and context\-aware instruct\-tts\.arXiv preprint arXiv:2509\.15845\.Cited by:[§3](https://arxiv.org/html/2605.06765#S3.p7.1),[§4\.3](https://arxiv.org/html/2605.06765#S4.SS3.p1.1),[§5\.1](https://arxiv.org/html/2605.06765#S5.SS1.p3.2)\.
- A\. Défossez, L\. Mazaré, M\. Orsini, A\. Royer, P\. Pérez, H\. Jégou, E\. Grave, and N\. Zeghidour \(2024\)Moshi: a speech\-text foundation model for real\-time dialogue\.arXiv preprint arXiv:2410\.00037\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p1.1),[§2](https://arxiv.org/html/2605.06765#S2.p2.1),[§2](https://arxiv.org/html/2605.06765#S2.p3.1)\.
- D\. Ding, Z\. Ju, Y\. Leng, S\. Liu, T\. Liu, Z\. Shang, K\. Shen, W\. Song, X\. Tan, H\. Tang,et al\.\(2025\)Kimi\-audio technical report\.arXiv preprint arXiv:2504\.18425\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p2.1),[§5\.3](https://arxiv.org/html/2605.06765#S5.SS3.p2.1)\.
- J\. Du, X\. Na, X\. Liu, and H\. Bu \(2018\)Aishell\-2: transforming mandarin asr research into industrial scale\.arXiv preprint arXiv:1808\.10583\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1)\.
- Z\. Du, Q\. Chen, S\. Zhang, K\. Hu, H\. Lu, Y\. Yang, H\. Hu, S\. Zheng, Y\. Gu, Z\. Ma,et al\.\(2024a\)Cosyvoice: a scalable multilingual zero\-shot text\-to\-speech synthesizer based on supervised semantic tokens\.arXiv preprint arXiv:2407\.05407\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p3.1)\.
- Z\. Du, C\. Gao, Y\. Wang, F\. Yu, T\. Zhao, H\. Wang, X\. Lv, H\. Wang, C\. Ni, X\. Shi,et al\.\(2025\)Cosyvoice 3: towards in\-the\-wild speech generation via scaling\-up and post\-training\.arXiv preprint arXiv:2505\.17589\.Cited by:[§3](https://arxiv.org/html/2605.06765#S3.p7.1),[§4\.3](https://arxiv.org/html/2605.06765#S4.SS3.p1.1)\.
- Z\. Du, Y\. Wang, Q\. Chen, X\. Shi, X\. Lv, T\. Zhao, Z\. Gao, Y\. Yang, C\. Gao, H\. Wang,et al\.\(2024b\)Cosyvoice 2: scalable streaming speech synthesis with large language models\.arXiv preprint arXiv:2412\.10117\.Cited by:[Appendix D](https://arxiv.org/html/2605.06765#A4.p3.1),[Appendix F](https://arxiv.org/html/2605.06765#A6.p1.1),[§3](https://arxiv.org/html/2605.06765#S3.p6.1)\.
- Q\. Fang, S\. Guo, Y\. Zhou, Z\. Ma, S\. Zhang, and Y\. Feng \(2024\)Llama\-omni: seamless speech interaction with large language models\.arXiv preprint arXiv:2409\.06666\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p1.1)\.
- Y\. Fu, L\. Cheng, S\. Lv, Y\. Jv, Y\. Kong, Z\. Chen, Y\. Hu, L\. Xie, J\. Wu, H\. Bu,et al\.\(2021\)Aishell\-4: an open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario\.arXiv preprint arXiv:2104\.03603\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1)\.
- D\. Galvez, G\. Diamos, J\. Ciro, J\. F\. Cerón, K\. Achorn, A\. Gopi, D\. Kanter, M\. Lam, M\. Mazumder, and V\. J\. Reddi \(2021\)The people’s speech: a large\-scale diverse english speech recognition dataset for commercial usage\.arXiv preprint arXiv:2111\.09344\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1)\.
- H\. Gao, H\. Shao, X\. Wang, C\. Qiu, Y\. Shen, S\. Cai, Y\. Shi, Z\. Xu, Z\. Long, Y\. Zhang,et al\.\(2025\)Lucy: linguistic understanding and control yielding early stage of her\.arXiv preprint arXiv:2501\.16327\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p3.1),[§2](https://arxiv.org/html/2605.06765#S2.p1.1),[§2](https://arxiv.org/html/2605.06765#S2.p2.1),[§3](https://arxiv.org/html/2605.06765#S3.p6.1),[§4\.2](https://arxiv.org/html/2605.06765#S4.SS2.p2.1)\.
- Z\. Gao, S\. Zhang, I\. McLoughlin, and Z\. Yan \(2022\)Paraformer: fast and accurate parallel transformer for non\-autoregressive end\-to\-end speech recognition\.arXiv preprint arXiv:2206\.08317\.Cited by:[Appendix D](https://arxiv.org/html/2605.06765#A4.p3.1)\.
- T\. Glm, A\. Zeng, B\. Xu, B\. Wang, C\. Zhang, D\. Yin, D\. Zhang, D\. Rojas, G\. Feng, H\. Zhao,et al\.\(2024\)Chatglm: a family of large language models from glm\-130b to glm\-4 all tools\.arXiv preprint arXiv:2406\.12793\.Cited by:[§5\.1](https://arxiv.org/html/2605.06765#S5.SS1.p2.1)\.
- Y\. Gong, L\. Jin, R\. Deng, D\. Zhang, X\. Zhang, Q\. Cheng, Z\. Fei, S\. Li, and X\. Qiu \(2025\)XY\-tokenizer: mitigating the semantic\-acoustic conflict in low\-bitrate speech codecs\.arXiv preprint arXiv:2506\.23325\.Cited by:[Appendix F](https://arxiv.org/html/2605.06765#A6.p2.1),[§2](https://arxiv.org/html/2605.06765#S2.p3.1),[§3](https://arxiv.org/html/2605.06765#S3.p6.1)\.
- H\. He, Z\. Shang, C\. Wang, X\. Li, Y\. Gu, H\. Hua, L\. Liu, C\. Yang, J\. Li, P\. Shi,et al\.\(2024\)Emilia: an extensive, multilingual, and diverse speech dataset for large\-scale speech generation\.In2024 IEEE Spoken Language Technology Workshop \(SLT\),pp\. 885–890\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p1.1)\.
- H\. Hu, X\. Zhu, T\. He, D\. Guo, B\. Zhang, X\. Wang, Z\. Guo, Z\. Jiang, H\. Hao, Z\. Guo,et al\.\(2026\)Qwen3\-tts technical report\.arXiv preprint arXiv:2601\.15621\.Cited by:[§3](https://arxiv.org/html/2605.06765#S3.p2.1),[§3](https://arxiv.org/html/2605.06765#S3.p7.1)\.
- A\. Huang, B\. Li, B\. Wang, B\. Wu, C\. Yan, C\. Feng, H\. Wang, H\. Zhou, H\. Wang, J\. Li,et al\.\(2025\)Step\-audio\-aqaa: a fully end\-to\-end expressive large audio language model\.arXiv preprint arXiv:2506\.08967\.Cited by:[§1](https://arxiv.org/html/2605.06765#S1.p1.1),[§5\.3](https://arxiv.org/html/2605.06765#S5.SS3.p2.1)\.
- W\. Huang, E\. Cooper, and T\. Toda \(2024\)Mos\-bench: benchmarking generalization abilities of subjective speech quality assessment models\.arXiv preprint arXiv:2411\.03715\.Cited by:[§5\.2](https://arxiv.org/html/2605.06765#S5.SS2.p1.1)\.
- J\. Hwang, S\. Lee, and S\. Lee \(2025\)HiddenSinger: high\-quality singing voice synthesis via neural audio codec and latent diffusion models\.Neural Networks181,pp\. 106762\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p5.1)\.
- J\. Kim, J\. Kong, and J\. Son \(2021\)Conditional variational autoencoder with adversarial learning for end\-to\-end text\-to\-speech\.InInternational Conference on Machine Learning,pp\. 5530–5540\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p5.1)\.
- KimiTeam, D\. Ding, Z\. Ju, Y\. Leng, S\. Liu, T\. Liu, Z\. Shang, K\. Shen, W\. Song, X\. Tan, H\. Tang, Z\. Wang, C\. Wei, Y\. Xin, X\. Xu, J\. Yu, Y\. Zhang, X\. Zhou, Y\. Charles, J\. Chen, Y\. Chen, Y\. Du, W\. He, Z\. Hu, G\. Lai, Q\. Li, Y\. Liu, W\. Sun, J\. Wang, Y\. Wang, Y\. Wu, Y\. Wu, D\. Yang, H\. Yang, Y\. Yang, Z\. Yang, A\. Yin, R\. Yuan, Y\. Zhang, and Z\. Zhou \(2025\)Kimi\-audio technical report\.External Links:2504\.18425,[Link](https://arxiv.org/abs/2504.18425)Cited by:[§5\.1](https://arxiv.org/html/2605.06765#S5.SS1.p2.1)\.
- C\. Li, Z\. Leng, C\. Yan, J\. Shen, H\. Wang, W\. MI, Y\. Fei, X\. Feng, S\. Yan, H\. Wang, L\. Zhan, Y\. Jia, P\. Wu, and H\. Sun \(2023\)ChatHaruhi: reviving anime character in reality via large language model\.External Links:2308\.09597,[Link](https://arxiv.org/abs/2308.09597)Cited by:[§1](https://arxiv.org/html/2605.06765#S1.p2.1),[§2](https://arxiv.org/html/2605.06765#S2.p4.1)\.
- J\. Li, L\. Du, H\. Zhao, B\. Zhang, L\. Wang, B\. Gao, G\. Liu, and Y\. Lin \(2025a\)Infinity instruct: scaling instruction selection and synthesis to enhance language models\.arXiv preprint arXiv:2506\.11116\.Cited by:[§4\.2](https://arxiv.org/html/2605.06765#S4.SS2.p2.1)\.
- T\. Li, J\. Liu, T\. Zhang, Y\. Fang, D\. Pan, M\. Wang, Z\. Liang, Z\. Li, M\. Lin, G\. Dong,et al\.\(2025b\)Baichuan\-audio: a unified framework for end\-to\-end speech interaction\.arXiv preprint arXiv:2502\.17239\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p2.1)\.
- Y\. Lipman, R\. T\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2023\)Flow matching for generative modeling\.In11th International Conference on Learning Representations, ICLR 2023,Cited by:[§3](https://arxiv.org/html/2605.06765#S3.p7.1)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§4\.2](https://arxiv.org/html/2605.06765#S4.SS2.p3.1)\.
- Z\. Long, Y\. Shen, C\. Fu, H\. Gao, L\. Li, P\. Chen, M\. Zhang, H\. Shao, J\. Li, J\. Peng, H\. Cao, K\. Li, R\. Ji, and X\. Sun \(2025a\)VITA\-audio: fast interleaved cross\-modal token generation for efficient large speech\-language model\.ArXivabs/2505\.03739\.External Links:[Link](https://api.semanticscholar.org/CorpusID:278339323)Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p4.1),[Appendix C](https://arxiv.org/html/2605.06765#A3.p1.18),[§2](https://arxiv.org/html/2605.06765#S2.p1.1),[§2](https://arxiv.org/html/2605.06765#S2.p2.1)\.
- Z\. Long, Y\. Shen, C\. Fu, H\. Gao, L\. Li, P\. Chen, M\. Zhang, H\. Shao, J\. Li, J\. Peng,et al\.\(2025b\)VITA\-audio: fast interleaved cross\-modal token generation for efficient large speech\-language model\.arXiv preprint arXiv:2505\.03739\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p3.1),[§4\.1](https://arxiv.org/html/2605.06765#S4.SS1.p1.1)\.
- J\. Lu, J\. Qin, L\. Qiao, Y\. Li, X\. Dai, B\. Ke, J\. He, R\. Qiao, D\. Yin, X\. Sun,et al\.\(2025\)Youtu\-llm: unlocking the native agentic potential for lightweight large language models\.arXiv preprint arXiv:2512\.24618\.Cited by:[§3](https://arxiv.org/html/2605.06765#S3.p2.1)\.
- C\. Ma, W\. Tao, and S\. Y\. Guo \(2025\)C3: a bilingual benchmark for spoken dialogue models exploring challenges in complex conversations\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 22789–22807\.Cited by:[§5\.3](https://arxiv.org/html/2605.06765#S5.SS3.p1.1)\.
- L\. Ma, D\. Guo, K\. Song, Y\. Jiang, S\. Wang, L\. Xue, W\. Xu, H\. Zhao, B\. Zhang, and L\. Xie \(2024\)Wenetspeech4tts: a 12,800\-hour mandarin tts corpus for large speech generation model benchmark\.arXiv preprint arXiv:2406\.05763\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p1.1)\.
- T\. A\. Nguyen, B\. Muller, B\. Yu, M\. R\. Costa\-Jussa, M\. Elbayad, S\. Popuri, C\. Ropers, P\. Duquenne, R\. Algayres, R\. Mavlyutov,et al\.\(2025\)Spirit\-lm: interleaved spoken and written language model\.Transactions of the Association for Computational Linguistics13,pp\. 30–52\.Cited by:[§1](https://arxiv.org/html/2605.06765#S1.p3.1),[§2](https://arxiv.org/html/2605.06765#S2.p2.1)\.
- C\. Pan, D\. Yao, Y\. Zhang, W\. Guo, J\. Lu, Z\. Zhu, and Z\. Zhao \(2025\)Synthetic singers: a review of deep\-learning\-based singing voice synthesis approaches\.InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics,pp\. 396–416\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p5.1)\.
- C\. Pan, D\. Yao, Y\. Zhang, W\. Guo, J\. Lu, Z\. Zhu, and Z\. Zhao \(2026\)Synthetic singers: a review of deep\-learning\-based singing voice synthesis approaches\.External Links:2601\.13910,[Link](https://arxiv.org/abs/2601.13910)Cited by:[§1](https://arxiv.org/html/2605.06765#S1.p2.1),[§4\.4](https://arxiv.org/html/2605.06765#S4.SS4.p1.1)\.
- V\. Panayotov, G\. Chen, D\. Povey, and S\. Khudanpur \(2015\)Librispeech: an asr corpus based on public domain audio books\.In2015 IEEE international conference on acoustics, speech and signal processing \(ICASSP\),pp\. 5206–5210\.Cited by:[Appendix D](https://arxiv.org/html/2605.06765#A4.p1.1)\.
- V\. Pratap, Q\. Xu, A\. Sriram, G\. Synnaeve, and R\. Collobert \(2020\)Mls: a large\-scale multilingual dataset for speech research\.arXiv preprint arXiv:2012\.03411\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[Appendix D](https://arxiv.org/html/2605.06765#A4.p1.1),[§5\.1](https://arxiv.org/html/2605.06765#S5.SS1.p2.1)\.
- I\. Qwen \(2023\)7B: open foundation and human\-aligned models \(of the state\-of\-the\-arts\)\.URL https://github\. com/QwenLM/Qwen\-7B/blob/main/tech\_memo\. md\.Cited by:[§5\.1](https://arxiv.org/html/2605.06765#S5.SS1.p2.1)\.
- A\. Radford, J\. W\. Kim, T\. Xu, G\. Brockman, C\. McLeavey, and I\. Sutskever \(2023\)Robust speech recognition via large\-scale weak supervision\.InInternational conference on machine learning,pp\. 28492–28518\.Cited by:[Appendix D](https://arxiv.org/html/2605.06765#A4.p3.1),[§2](https://arxiv.org/html/2605.06765#S2.p4.1),[§3](https://arxiv.org/html/2605.06765#S3.p8.1)\.
- Y\. Shi, H\. Bu, X\. Xu, S\. Zhang, and M\. Li \(2020\)Aishell\-3: a multi\-speaker mandarin tts corpus and the baselines\.arXiv preprint arXiv:2010\.11567\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1)\.
- Silero\-Team \(2024\)Silero vad: pre\-trained enterprise\-grade voice activity detector \(vad\), number detector and language classifier\.GitHub\.Note:[https://github\.com/snakers4/silero\-vad](https://github.com/snakers4/silero-vad)Cited by:[§3](https://arxiv.org/html/2605.06765#S3.p8.1)\.
- H\. Siuzdak, F\. Grötschla, and L\. A\. Lanzendörfer \(2024\)SNAC: multi\-scale neural audio codec\.InAudio Imagination: NeurIPS 2024 Workshop AI\-Driven Speech, Music, and Sound Generation,Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p3.1)\.
- Y\. Tang, L\. Liu, W\. Feng, Y\. Zhao, J\. Han, Y\. Yu, J\. Shi, and Q\. Jin \(2025\)SingMOS\-pro: an comprehensive benchmark for singing quality assessment\.arXiv preprint arXiv:2510\.01812\.Cited by:[§5\.2](https://arxiv.org/html/2605.06765#S5.SS2.p1.1)\.
- TEN\-Team \(2025\)TEN turn detection: turn detection for full\-duplex dialogue communication\.External Links:[Link](https://github.com/TEN-framework/ten-turn-detection)Cited by:[§3](https://arxiv.org/html/2605.06765#S3.p8.1)\.
- Q\. Tu, S\. Fan, Z\. Tian, T\. Shen, S\. Shang, X\. Gao, and R\. Yan \(2024\)Charactereval: a chinese benchmark for role\-playing conversational agent evaluation\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Vol\.1,pp\. 11836–11850\.Cited by:[§5\.1](https://arxiv.org/html/2605.06765#S5.SS1.p1.1)\.
- C\. Wang, M\. Riviere, A\. Lee, A\. Wu, C\. Talnikar, D\. Haziza, M\. Williamson, J\. Pino, and E\. Dupoux \(2021\)VoxPopuli: a large\-scale multilingual speech corpus for representation learning, semi\-supervised learning and interpretation\.arXiv preprint arXiv:2101\.00390\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1)\.
- C\. Wang, T\. Peng, W\. Yang, Y\. Bai, G\. Wang, J\. Lin, L\. Jia, L\. Wu, J\. Wang, C\. Zong,et al\.\(2025a\)Opens2s: advancing fully open\-source end\-to\-end empathetic large speech language model\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,pp\. 906–917\.Cited by:[§4\.2](https://arxiv.org/html/2605.06765#S4.SS2.p3.1)\.
- H\. Wang, S\. Zheng, Y\. Chen, L\. Cheng, and Q\. Chen \(2023\)Cam\+\+: a fast and efficient network for speaker verification using context\-aware masking\.arXiv preprint arXiv:2303\.00332\.Cited by:[§3](https://arxiv.org/html/2605.06765#S3.p7.1)\.
- W\. Wang, Y\. Song, and S\. Jha \(2024a\)GLOBE: a high\-quality english corpus with global accents for zero\-shot speaker adaptive text\-to\-speech\.External Links:2406\.14875,[Link](https://arxiv.org/abs/2406.14875)Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p1.1)\.
- X\. Wang, M\. Jiang, Z\. Ma, Z\. Zhang, S\. Liu, L\. Li, Z\. Liang, Q\. Zheng, R\. Wang, X\. Feng,et al\.\(2025b\)Spark\-tts: an efficient llm\-based text\-to\-speech model with single\-stream decoupled speech tokens\.arXiv preprint arXiv:2503\.01710\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p3.1)\.
- X\. Wang, Y\. Li, C\. Fu, L\. Xie, K\. Li, X\. Sun, and L\. Ma \(2024b\)Freeze\-omni: a smart and low latency speech\-to\-speech dialogue model with frozen llm\.arXiv preprint arXiv:2411\.00774\.Cited by:[Appendix D](https://arxiv.org/html/2605.06765#A4.p1.1)\.
- Z\. M\. Wang, Z\. Peng, H\. Que, J\. Liu, W\. Zhou, Y\. Wu, H\. Guo, R\. Gan, Z\. Ni, J\. Yang, M\. Zhang, Z\. Zhang, W\. Ouyang, K\. Xu, S\. W\. Huang, J\. Fu, and J\. Peng \(2024c\)RoleLLM: benchmarking, eliciting, and enhancing role\-playing abilities of large language models\.External Links:2310\.00746,[Link](https://arxiv.org/abs/2310.00746)Cited by:[§1](https://arxiv.org/html/2605.06765#S1.p2.1)\.
- Y\. Wu, J\. Shi, Y\. Tang, S\. Yang, Q\. Jin,et al\.\(2024\)Toksing: singing voice synthesis based on discrete tokens\.arXiv preprint arXiv:2406\.08416\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p5.1)\.
- Z\. Xie and C\. Wu \(2024\)Mini\-omni: language models can hear, talk while thinking in streaming\.arXiv preprint arXiv:2408\.16725\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p3.1),[§1](https://arxiv.org/html/2605.06765#S1.p3.1),[§2](https://arxiv.org/html/2605.06765#S2.p1.1),[§2](https://arxiv.org/html/2605.06765#S2.p2.1),[§3](https://arxiv.org/html/2605.06765#S3.p6.1),[§4\.2](https://arxiv.org/html/2605.06765#S4.SS2.p2.1)\.
- J\. Xu, Z\. Guo, J\. He, H\. Hu, T\. He, S\. Bai, K\. Chen, J\. Wang, Y\. Fan, K\. Dang,et al\.\(2025a\)Qwen2\. 5\-omni technical report\.arXiv preprint arXiv:2503\.20215\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p1.1),[§5\.3](https://arxiv.org/html/2605.06765#S5.SS3.p2.1)\.
- J\. Xu, Z\. Guo, H\. Hu, Y\. Chu, X\. Wang, J\. He, Y\. Wang, X\. Shi, T\. He, X\. Zhu, Y\. Lv, Y\. Wang, D\. Guo, H\. Wang, L\. Ma, P\. Zhang, X\. Zhang, H\. Hao, Z\. Guo, B\. Yang, B\. Zhang, Z\. Ma, X\. Wei, S\. Bai, K\. Chen, X\. Liu, P\. Wang, M\. Yang, D\. Liu, X\. Ren, B\. Zheng, R\. Men, F\. Zhou, B\. Yu, J\. Yang, L\. Yu, J\. Zhou, and J\. Lin \(2025b\)Qwen3\-omni technical report\.External Links:2509\.17765,[Link](https://arxiv.org/abs/2509.17765)Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p3.1),[Appendix D](https://arxiv.org/html/2605.06765#A4.p1.1),[§2](https://arxiv.org/html/2605.06765#S2.p1.1)\.
- XVERSE\-7B \(2023\)XVERSE\-7b\.URL https://github\.com/xverse\-ai/XVERSE\-7B/blob/main/README\.md\.Cited by:[§5\.1](https://arxiv.org/html/2605.06765#S5.SS1.p2.1)\.
- R\. Yan, X\. Li, W\. Chen, Z\. Niu, C\. Yang, Z\. Ma, K\. Yu, and X\. Chen \(2025\)Uro\-bench: a comprehensive benchmark for end\-to\-end spoken dialogue models\.arXiv preprint arXiv:2502\.17810\.Cited by:[Appendix F](https://arxiv.org/html/2605.06765#A6.p2.1),[§5\.3](https://arxiv.org/html/2605.06765#S5.SS3.p1.1)\.
- A\. Yang, B\. Xiao, B\. Wang, B\. Zhang, C\. Bian, C\. Yin, C\. Lv, D\. Pan, D\. Wang, D\. Yan,et al\.\(2023\)Baichuan 2: open large\-scale language models\.arXiv preprint arXiv:2309\.10305\.Cited by:[§5\.1](https://arxiv.org/html/2605.06765#S5.SS1.p2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2605.06765#S5.SS1.p2.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024a\)Qwen2\. 5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p4.1)\.
- D\. Yang, S\. Liu, R\. Huang, C\. Weng, and H\. Meng \(2024b\)Instructtts: modelling expressive tts in discrete latent space with natural language style prompt\.IEEE/ACM Transactions on Audio, Speech, and Language Processing32,pp\. 2913–2925\.Cited by:[§4\.3](https://arxiv.org/html/2605.06765#S4.SS3.p1.1)\.
- Z\. Ye, P\. Sun, J\. Lei, H\. Lin, X\. Tan, Z\. Dai, Q\. Kong, J\. Chen, J\. Pan, Q\. Liu,et al\.\(2025\)Codec does matter: exploring the semantic shortcoming of codec for audio language model\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 25697–25705\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p3.1)\.
- H\. Zen, V\. Dang, R\. Clark, Y\. Zhang, R\. J\. Weiss, Y\. Jia, Z\. Chen, and Y\. Wu \(2019\)Libritts: a corpus derived from librispeech for text\-to\-speech\.arXiv preprint arXiv:1904\.02882\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1)\.
- A\. Zeng, Z\. Du, M\. Liu, K\. Wang, S\. Jiang, L\. Zhao, Y\. Dong, and J\. Tang \(2024a\)GLM\-4\-voice: towards intelligent and human\-like end\-to\-end spoken chatbot\.arXiv preprint arXiv:2412\.02612\.Cited by:[Appendix D](https://arxiv.org/html/2605.06765#A4.p1.1),[Appendix F](https://arxiv.org/html/2605.06765#A6.p1.1),[Appendix F](https://arxiv.org/html/2605.06765#A6.p2.1),[§2](https://arxiv.org/html/2605.06765#S2.p1.1),[§2](https://arxiv.org/html/2605.06765#S2.p2.1),[§2](https://arxiv.org/html/2605.06765#S2.p3.1),[§3](https://arxiv.org/html/2605.06765#S3.p6.1),[§5\.3](https://arxiv.org/html/2605.06765#S5.SS3.p2.1)\.
- A\. Zeng, Z\. Du, M\. Liu, L\. Zhang, S\. Jiang, Y\. Dong, and J\. Tang \(2024b\)Scaling speech\-text pre\-training with synthetic interleaved data\.arXiv preprint arXiv:2411\.17607\.Cited by:[§1](https://arxiv.org/html/2605.06765#S1.p3.1)\.
- B\. Zhang, H\. Lv, P\. Guo, Q\. Shao, C\. Yang, L\. Xie, X\. Xu, H\. Bu, X\. Chen, C\. Zeng,et al\.\(2022a\)Wenetspeech: a 10000\+ hours multi\-domain mandarin corpus for speech recognition\.InICASSP 2022\-2022 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 6182–6186\.Cited by:[Appendix B](https://arxiv.org/html/2605.06765#A2.p2.1),[Appendix D](https://arxiv.org/html/2605.06765#A4.p1.1)\.
- H\. Zhang, R\. Luo, X\. Liu, Y\. Wu, T\. Lin, P\. Zeng, Q\. Qu, F\. Fang, M\. Yang, L\. Gao, J\. Song, F\. Huang, and Y\. Li \(2025a\)OmniCharacter: towards immersive role\-playing agents with seamless speech\-language personality interaction\.External Links:2505\.20277,[Link](https://arxiv.org/abs/2505.20277)Cited by:[§1](https://arxiv.org/html/2605.06765#S1.p2.1),[§2](https://arxiv.org/html/2605.06765#S2.p4.1)\.
- X\. L\. T\. D\. Zhang, G\. Wang, J\. Xue, K\. Fang, L\. Zhao, R\. Ma, S\. Ren, S\. Liu, T\. Guo, W\. Zhuang, X\. Zhang, X\. Song, Y\. Yan, Y\. He, Cici, B\. Shen, C\. Zhu, C\. Ma, C\. Chen, H\. Chen, J\. Li, L\. Li, M\. Zhu, P\. Li, Q\. Wang, S\. Deng, W\. Xiong, W\. Huang, W\. Yang, Y\. Jiang, Y\. Yang, Y\. Tian, Y\. Ma, Y\. Yu, Z\. Zhang, Z\. Yue, B\. Xiao, B\. Xia, B\. Gao, B\. Ye, C\. Cai, C\. Liu, C\. He, C\. Li, D\. Zhu, D\. Zhang, F\. Shi, G\. Wang, H\. Zhang, H\. Lv, H\. Li, H\. Tian, H\. Qu, H\. Xu, H\. Zhang, H\. Liu, J\. Duo, J\. Zuo, J\. Wei, J\. Xiao, J\. Dong, J\. Shi, J\. Hu, K\. Bao, K\. Zhou, L\. Zhang, M\. Chen, N\. Chen, P\. Zhang, Q\. Chen, Q\. Wang, R\. Li, S\. Liu, S\. Wang, S\. Li, S\. Yu, S\. Cao, S\. Chen, S\. Gu, W\. Wang, W\. Ma, X\. Deng, X\. Yong, X\. Zhang, X\. Wang, Y\. Song, Y\. Zhao, Y\. Zhao, Y\. Gao, Y\. Cheng, Y\. Tu, Y\. Wang, Z\. Huang, Z\. Tang, Z\. Lin, Z\. Song, Z\. Xu, Z\. Zheng, and Z\. Jiang \(2025b\)MiMo\-audio: audio language models are few\-shot learners\.ArXivabs/2512\.23808\.External Links:[Link](https://api.semanticscholar.org/CorpusID:284351195)Cited by:[§1](https://arxiv.org/html/2605.06765#S1.p1.1),[§2](https://arxiv.org/html/2605.06765#S2.p1.1),[§2](https://arxiv.org/html/2605.06765#S2.p2.1)\.
- Y\. Zhang, J\. Cong, H\. Xue, L\. Xie, P\. Zhu, and M\. Bi \(2022b\)Visinger: variational inference with adversarial learning for end\-to\-end singing voice synthesis\.InICASSP 2022\-2022 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 7237–7241\.Cited by:[§2](https://arxiv.org/html/2605.06765#S2.p5.1)\.
- S\. Zhou, Y\. Zhou, Y\. He, X\. Zhou, J\. Wang, W\. Deng, and J\. Shu \(2025\)IndexTTS2: a breakthrough in emotionally expressive and duration\-controlled auto\-regressive zero\-shot text\-to\-speech\.arXiv preprint arXiv:2506\.21619\.Cited by:[§3](https://arxiv.org/html/2605.06765#S3.p7.1),[§4\.3](https://arxiv.org/html/2605.06765#S4.SS3.p1.1)\.

## Appendix AAppendix

## Appendix BPretraining Data

TTS DataWe leverage a massive dataset totaling approximately 867K hours for Text\-to\-Speech \(TTS\) training\. We utilize a TTS data pipeline \(audio denoising, speaker diarization, and ASR\) to curate 762K hours of high\-quality corpus from diverse sources, including audiobooks, podcasts, children’s stories, and traditional Chinese performing arts\. Concurrently, we integrate 105K hours of open\-source TTS data, mainly consisting of the WenetSpeech4TTS\(Maet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib116)\), LibriTTS\(Zenet al\.,[2019](https://arxiv.org/html/2605.06765#bib.bib117)\), GLOBEv2\(Wanget al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib118)\), and Emilia\(Heet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib119)\)datasets\. This combination of large\-scale proprietary recordings and diverse public corpora ensures superior prosodic richness and multi\-speaker modeling\.

ASR DataWe aggregate approximately 100K hours of open\-source Automatic Speech Recognition \(ASR\) data, including WenetSpeech\(Zhanget al\.,[2022a](https://arxiv.org/html/2605.06765#bib.bib120)\), Librispeech\(Zenet al\.,[2019](https://arxiv.org/html/2605.06765#bib.bib117)\), Mls\(Pratapet al\.,[2020](https://arxiv.org/html/2605.06765#bib.bib122)\), Common Voice\(Ardilaet al\.,[2020](https://arxiv.org/html/2605.06765#bib.bib123)\), SLR68, GigaSpeech\(Chenet al\.,[2021](https://arxiv.org/html/2605.06765#bib.bib124)\), People’s Speech\(Galvezet al\.,[2021](https://arxiv.org/html/2605.06765#bib.bib125)\), VoxPopuli\(Wanget al\.,[2021](https://arxiv.org/html/2605.06765#bib.bib126)\), and the AISHELL series\(Buet al\.,[2017](https://arxiv.org/html/2605.06765#bib.bib127); Duet al\.,[2018](https://arxiv.org/html/2605.06765#bib.bib128); Shiet al\.,[2020](https://arxiv.org/html/2605.06765#bib.bib129); Fuet al\.,[2021](https://arxiv.org/html/2605.06765#bib.bib130)\)\.

SQA DataSpeech\-to\-text understanding takes spoken questions as input and generates textual responses as output\. This task is ignored in native SLMs\(Xie and Wu,[2024](https://arxiv.org/html/2605.06765#bib.bib43); Gaoet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib99); Longet al\.,[2025b](https://arxiv.org/html/2605.06765#bib.bib100)\), but it has been shown to significantly preserve the intelligence of LLMs in recent modular SLMs\(Chenet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib48); Xuet al\.,[2025b](https://arxiv.org/html/2605.06765#bib.bib103)\)\. As a result, we develop a Speech Question Answering \(SQA\) dataset selected from in\-house QA text pairs\. The SQA dataset covers approximately 18K hours of question speech, including general knowledge, commonsense reasoning, and reading comprehension\.

Text DataFollowing VITA\-Audio\(Longet al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib30)\), we collect open\-source text corpora of1111B tokens covering conversations on knowledge answering, understanding, reasoning, and long\-context text\.

## Appendix CTraining Details

We trainVITA\-QinYuusing a standard align\-pretrain\-SFT training pipeline\. Stage0is for modality alignment, where we freeze the audio encoder and the backone LLM and train only the MLP\-based audio adapter10%10\\%of pretraining data to align the audio encoder’s output with the LLM’s input space\. The adapter is trained500500steps with an effective batch size of1\.281\.28M and a learning rate of1e1e\-33\. Stage11is for pretraining, where we train both the audio adapter and the LLM on our full open\-source and in\-house ASR, TTS, SQA, and text pretraining data as described in Sec[4\.1](https://arxiv.org/html/2605.06765#S4.SS1)\. The model is trained88K steps with an effective batch size of2\.562\.56M tokens and a learning rate of6e6e\-55\. Stage22is for supervised finetuning, where we train both the audio adapter and the LLM on mainly spoken dialogue data, including natural conversation, role\-playing, and singing data, as well as a small portion of pretraining data from the previous stage\. We train88K steps for Stage22with an effective batch size of1\.281\.28M tokens and a learning rate of6e6e\-55\. Following VITA\-Audio\(Longet al\.,[2025a](https://arxiv.org/html/2605.06765#bib.bib30)\), all training data are packed into sequences with a fixed length of1010K tokens to maximize GPU utility\.

## Appendix DASR and TTS

Table 6:Results on Automatic Speech Recognition \(ASR\) Benchmarks\. The best results are highlighted inbold, and the second\-best areunderlined\.ASRWe evaluate the ASR performance ofVITA\-QinYuon WenetSpeech\(Zhanget al\.,[2022a](https://arxiv.org/html/2605.06765#bib.bib120)\)and AIShell\(Buet al\.,[2017](https://arxiv.org/html/2605.06765#bib.bib127)\)datasets for Chinese and LibriSpeech\(Panayotovet al\.,[2015](https://arxiv.org/html/2605.06765#bib.bib121)\)for English\. We compute character error rates\(CER\) and word error rates \(WER\) for Chinese and English, respectively\. The results are shown in Table[6](https://arxiv.org/html/2605.06765#A4.T6)\. The performance of Qwen3\-Omni\(Xuet al\.,[2025b](https://arxiv.org/html/2605.06765#bib.bib103)\)with 30B parameters is provided as a top\-line reference\. We observe thatVITA\-QinYu\-8B achieves lower error rates thanVITA\-QinYu\-4B, possibly due to stronger language capabilities, andVITA\-QinYu\-4B andVITA\-QinYu\-8B have lower error rates than the GLM\-4\-Voice\(Zenget al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib47)\)and Freeze\-Omni\(Wanget al\.,[2024b](https://arxiv.org/html/2605.06765#bib.bib45)\)models in the three datasets and achieve a performance comparable to that of Qwen2\.5\-OmniQwenet al\.\([2025](https://arxiv.org/html/2605.06765#bib.bib138)\)\.

TTS

Table 7:Results on Automatic Speech Recognition \(ASR\) Benchmarks\. The best results are highlighted inbold, and the second\-best areunderlined\.We evaluate the TTS performance ofVITA\-QinYuon the Seed\-TTS\(Anastassiouet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib4)\)benchmark\. The generated English speech is transcribed into text using Whisper\-Large\-V3\(Radfordet al\.,[2023](https://arxiv.org/html/2605.06765#bib.bib59)\)and Chinese speech using Paraformer\(Gaoet al\.,[2022](https://arxiv.org/html/2605.06765#bib.bib16)\)\. The CER and WER are computed by comparing the resulting transcription and the ground\-truth text for Chinese and English, respectively\. We observe thatVITA\-QinYu\-8B andVITA\-QinYu\-4B perform comparably, with the 4B variant slightly better in English and the 8B variant slightly better in Chinese\. In particular,VITA\-QinYu\-8B produces speech more faithfully aligned with the input text than GLM\-4\-Voice and Qwen2\.5\-Omni in Chinese, achieving state\-of\-the\-art performance comparable to, and occasionally surpassing, specialized TTS models such as Seed\-TTS\(Anastassiouet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib4)\)and CosyVoice2\(Duet al\.,[2024b](https://arxiv.org/html/2605.06765#bib.bib6)\)\.

## Appendix EDetailed Role\-playing Results

Table 8:Detailed objective evaluation results on generated text response for role\-playing\. The text responses are evaluated across character consistency \(Consistency\), conversational ability \(Conversation\), and role\-playing attractiveness \(Attractiveness\)\. Character Consistency is evaluated on Knowledge\-Exposure \(KE\), Knowledge\-Accuracy \(KA\), Knowledge\-Hallucination \(KH\), Persona\-Behavior \(PB\), and Persona\-Utterance \(PU\); Conversational ability is evaluated on Fluency \(Flu\.\), Coherency \(Coh\.\), and Consistency \(Cons\.\); Attractiveness is evaluated on Human\-Like \(HL\), Communication Skills \(CS\), Expression Diversity \(ED\), and Empathy \(Emp\.\)\.ModelConsistencyConversationAttractivenessKEKAKHPBPUAvg\.Flu\.Coh\.Cons\.Avg\.HLCSEDEmpAvg\.Baichuan2\-7B1\.833\.032\.701\.332\.592\.413\.533\.883\.693\.703\.102\.971\.363\.262\.67Qwen3\-8B1\.833\.122\.741\.582\.642\.523\.593\.913\.733\.743\.212\.991\.533\.332\.76ChatGLM3\-6B1\.872\.972\.571\.292\.442\.323\.323\.703\.393\.472\.842\.931\.333\.142\.56Qwen\-7B\-Chat1\.842\.992\.621\.352\.502\.373\.463\.833\.583\.622\.942\.961\.383\.212\.62XVERSE\-7B\-Chat1\.833\.052\.721\.322\.602\.423\.533\.883\.703\.703\.112\.981\.353\.282\.68GLM\-9B\-Chat1\.753\.102\.731\.552\.692\.523\.613\.913\.803\.773\.342\.871\.523\.322\.76Qwen2\.5\-7B\-Instruct1\.863\.082\.751\.362\.632\.453\.573\.923\.743\.743\.133\.031\.383\.322\.71Qwen2\.5\-Omini1\.883\.382\.741\.222\.712\.513\.563\.813\.663\.683\.243\.061\.263\.462\.76Kimi\-Audio1\.873\.412\.751\.242\.692\.523\.623\.893\.793\.773\.233\.041\.263\.502\.76VITA\-QinYu\-4B1\.813\.082\.711\.352\.612\.443\.543\.873\.683\.703\.152\.981\.353\.302\.69VITA\-QinYu\-8B1\.753\.122\.731\.532\.692\.523\.623\.913\.793\.773\.352\.881\.513\.322\.77Detailed role\-playing results are shown in Table[8](https://arxiv.org/html/2605.06765#A5.T8)\.

## Appendix FMulti\- v\.s\. Single\-Codebook Speech Tokenizers

Table 9:MSE and DTW distance for XY\-Tokenizer, GLM\-4\-Voice Tokenizer and Cosyvoice2 Tokenizer\.Pitch Contour ComparisonIn our preliminary experiments, we compare the multi\-codebook XY\-Tokenizer with the single\-codebook CosyVoice2 Tokenizer\(Duet al\.,[2024b](https://arxiv.org/html/2605.06765#bib.bib6)\)and GLM\-4\-Voice Tokenizer\(Zenget al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib47)\)for singing voice reconstruction\. XY\-Tokenizer operates at 100 Hz, whereas CosyVoice2 and GLM\-4\-Voice use lower token rates of 25 Hz and 12\.5 Hz, respectively\. The CosyVoice2 decoder additionally requires speaker embeddings to recover speaker identity, while GLM\-4\-Voice neither requires nor preserves speaker identity\. We randomly sample 100 songs from the singing data and reconstruct them using each tokenizer\. The normalized pitch contours of the reconstructed audio are extracted and compared with those of the original recordings\. Mean squared error \(MSE\) and dynamic time warping \(DTW\) distances are computed for quantitative evaluation\. The results, shown in Table[9](https://arxiv.org/html/2605.06765#A6.T9), indicate that XY\-Tokenizer achieves the best reconstruction with the lowest MSE and DTW distance\. Consistent with this, qualitative listening indicates that the XY\-Tokenizer preserves the original melody, while the other two methods miss most melodic variations\.

Table 10:UTMOS Results for English and Chinese on the URO Benchmark\.UTMOS ComparisonWe then compareVITA\-QinYuusing XY\-Tokenizer\(Gonget al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib17)\)with VITA\-Audio using the GLM\-4\-Voice Tokenizer\(Zenget al\.,[2024a](https://arxiv.org/html/2605.06765#bib.bib47)\)\. Speech responses are generated in the URO Benchmark\(Yanet al\.,[2025](https://arxiv.org/html/2605.06765#bib.bib22)\), and UTMOS\(Babaet al\.,[2024](https://arxiv.org/html/2605.06765#bib.bib2)\)scores—a reference\-free metric that assesses speech naturalness—are computed\. The results are summarized in Table[10](https://arxiv.org/html/2605.06765#A6.T10)\. We observe thatVITA\-QinYu\-4B with XY\-Tokenizer achieves higher UTMOS scores in both English and Chinese, despite its smaller model scale\.

## Appendix GAblation on Speaker Pretraining

Table 11:Objective evaluation results for role\-playing in both text and speech\. Text responses are assessed in terms of character consistency \(Consistency\), conversational ability \(Conversation\), and role\-playing attractiveness \(Attractiveness\), while speech responses are evaluated by speaker similarity with respect to ground\-truth speech\. The overall performance is reported as the average score \(Avg\.\) across these four dimensions\. The best results are highlighted inbold, and the second\-best areunderlined\.We conduct an ablation study to evaluate the effect of speaker pretraining\. Specifically, in Stage 1 \(pretraining\), we remove both agent and user embedding information from the input and then train Stage 2 from the Stage 1 checkpoint, resulting inVITA\-QinYu\-8B\-nospk\. We compare its performance on role\-playing responses with that ofVITA\-QinYu\-8B\. The results are presented in Table[11](https://arxiv.org/html/2605.06765#A7.T11)\. We observe thatVITA\-QinYu\-8B\-nospkunderperformsVITA\-QinYu\-8B in both subjective and objective evaluations, indicating that explicitly injecting speaker information during pretraining effectively decouples content and timbre\. This facilitates better modeling of both text and speech, thus improving role\-playing performance\.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

Similar Articles

DramaBox - Most Expressive Voice model ever based on LTX 2.3

@KevinQHLin: IntroducingViolin — an Open-source Video Translation Skill. Video is the dominant medium on the internet, yet most high…

@zohaibahmed: New Voice AI Model from @resembleai's Research Team: Dramabox! A Voice AI model SHOULD give you two things, an oscar-wo…

VibeVoice Technical Report

OpenAI's New Voice Models Want to Do More Than Talk Back

Submit Feedback

Similar Articles

DramaBox - Most Expressive Voice model ever based on LTX 2.3
@KevinQHLin: IntroducingViolin — an Open-source Video Translation Skill. Video is the dominant medium on the internet, yet most high…
@zohaibahmed: New Voice AI Model from @resembleai's Research Team: Dramabox! A Voice AI model SHOULD give you two things, an oscar-wo…
OpenAI's New Voice Models Want to Do More Than Talk Back