Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean

arXiv cs.CL 06/26/26, 04:00 AM Papers
text-to-speech low-resource lora fine-tuning khmer korean voxcpm2
Summary
This paper investigates LoRA fine-tuning of the VoxCPM2 TTS model to improve quality for low-resource languages like Khmer, while showing no gain for Korean which the base model already handles well. The adapter yields significant MOS improvement for Khmer with minimal parameter training.
arXiv:2606.26618v1 Announce Type: new Abstract: Large pretrained text-to-speech (TTS) models sound almost human for well-resourced languages, but much worse for languages that are rare in their training data. We study this quality gap for Khmer and Korean using VoxCPM2, a 2.4B-parameter, tokenizer-free TTS model that joins a MiniCPM-4 language-model backbone with a flow-matching diffusion decoder. We build one shared, language-tagged corpus of about 26 hours and adapt VoxCPM2 with a single Low-Rank Adaptation (LoRA) adapter, trained on both languages at once and added to both the language model and the decoder. The adapter is zero-initialized, so training starts exactly at the original (zero-shot) model. In native-speaker listening tests, the Khmer Mean Opinion Score (MOS) rises from 3.85 to 4.23 with the best adapter (rank 64), a highly significant gain (paired Wilcoxon test, p<0.001), while training only 0.19 to 3.03 percent of the parameters. The automatic loss and the human ratings, however, disagree on the best rank: validation loss is lowest at rank 128, yet MOS peaks at rank 64. The same adapter brings no gain for Korean, a language the base model already handles well, and at a high rank it even degrades quality. Adaptation therefore helps mainly where the base model is genuinely weak.
Original Article
View Cached Full Text
Cached at: 06/26/26, 05:18 AM
# Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean
Source: [https://arxiv.org/html/2606.26618](https://arxiv.org/html/2606.26618)
Phannet Pov1,2, Sovandara Chhoun1, Hyun Woo Park1, Wan\-Sup Cho3, Saksonita Khoeurn3,4∗\\ast

###### Abstract

Large pretrained text\-to\-speech \(TTS\) models sound almost human for well\-resourced languages, but much worse for languages that are rare in their training data\. We study this quality gap for Khmer and Korean using VoxCPM2, a 2\.4B\-parameter, tokenizer\-free TTS model that joins a MiniCPM\-4 language\-model backbone with a flow\-matching diffusion decoder\. We build one shared, language\-tagged corpus of about 26 hours and adapt VoxCPM2 with a single Low\-Rank Adaptation \(LoRA\) adapter, trained on both languages at once and added to both the language model and the decoder\. The adapter is zero\-initialized, so training starts exactly at the original \(zero\-shot\) model\. In native\-speaker listening tests, the Khmer Mean Opinion Score \(MOS\) rises from3\.853\.85to4\.234\.23with the best adapter \(rank 64\), a highly significant gain \(paired Wilcoxon test,p<0\.001p<0\.001\), while training only 0\.19 to 3\.03 percent of the parameters\. The automatic loss and the human ratings, however, disagree on the best rank: validation loss is lowest at rank 128, yet MOS peaks at rank 64\. The same adapter brings no gain for Korean, a language the base model already handles well, and at a high rank it even degrades quality\. Adaptation therefore helps mainly where the base model is genuinely weak\.

## IIntroduction

Neural text\-to\-speech \(TTS\) has advanced rapidly\. Early systems such as Tacotron 2\[[1](https://arxiv.org/html/2606.26618#bib.bib1)\]and FastSpeech 2\[[2](https://arxiv.org/html/2606.26618#bib.bib2)\]predicted intermediate acoustic features, whereas newer end\-to\-end and large generative systems\[[3](https://arxiv.org/html/2606.26618#bib.bib3),[4](https://arxiv.org/html/2606.26618#bib.bib4)\]now approach human quality for well\-resourced languages such as English and Mandarin, largely on the strength of large\-scale pretraining\. Foundation models such as VoxCPM\[[5](https://arxiv.org/html/2606.26618#bib.bib5)\]and its successor VoxCPM2\[[6](https://arxiv.org/html/2606.26618#bib.bib6)\]are trained on millions of hours of multilingual speech and synthesize highly natural, context\-aware speech\.

Even so, the output is not yet on par with a native speaker\. Zero\-shot synthesis from the base model is already usable, but it still mispronounces words, places stress and intonation awkwardly, and retains a synthetic quality that listeners notice\. The shortfall widens for under\-resourced languages that appear only rarely in pretraining\[[7](https://arxiv.org/html/2606.26618#bib.bib7),[8](https://arxiv.org/html/2606.26618#bib.bib8)\], and it is precisely this regime we target\. Our primary case is Khmer, the official language of Cambodia and a genuinely low\-resource language whose orthography, unlike English or Korean, places no spaces between words\[[9](https://arxiv.org/html/2606.26618#bib.bib9)\]\. To distinguish genuine adaptation from a gain that any language would enjoy, we pair Khmer with Korean, which the base model already handles well\.

The conventional remedy is full fine\-tuning, which updates every parameter\. It can restore quality, but at a steep price: substantial compute and storage, a separate multi\-billion\-parameter checkpoint for each language, and the risk that the model forgets what it already knew\. Parameter\-efficient fine\-tuning \(PEFT\) sidesteps this\. Low\-Rank Adaptation \(LoRA\)\[[10](https://arxiv.org/html/2606.26618#bib.bib10)\], in particular, freezes the pretrained weights and trains only small low\-rank matrices, so just a tiny fraction of the parameters change\. LoRA is well established for large language models\[[10](https://arxiv.org/html/2606.26618#bib.bib10),[11](https://arxiv.org/html/2606.26618#bib.bib11)\], but for speech two questions remain open: how far it can close the quality gap in low\-resource TTS, and whether a*single shared*adapter can serve several very different languages at once\.

This paper investigates whether a small LoRA adapter on VoxCPM2 can close this gap, using Khmer and Korean, two languages that the base model covers to different degrees\. Our contributions are:

- •One shared adapter for two languages and two modules\.We train a*single*LoRA adapter on Khmer and Korean together, and we add it to*both*the MiniCPM\-4 language model and the flow\-matching decoder\. One small adapter \(0\.190\.19to3\.03%3\.03\\%of the parameters\) then serves both scripts, with no separate model per language\. As far as we know, this is the first parameter\-efficient adaptation of a foundation TTS model for Khmer\.
- •Adaptation helps only where the base model is weak\.We measure native\-speaker MOS and test for significance\. The same adapter gives a large, highly significant gain for Khmer, which the base model covers poorly \(overall MOS from 3\.85 to 4\.23, an improvement of 0\.38 points,p<0\.001p<0\.001\), but no significant gain for Korean, which it already covers well \(the best rank is only 0\.11 points higher,p=0\.49p=0\.49\); a high rank even makes Korean worse\. The adapter thus fills a genuine deficit rather than helping every language uniformly\.
- •Training loss does not predict the best rank\.We test ranks 8, 16, 32, 64, and 128\. The validation loss is lowest at rank 128, but Khmer MOS \(naturalness, prosody, pronunciation\) is highest at rank 64 and then drops\. The loss therefore overstates the value of extra capacity\. The rank should be chosen by listening tests, and the small rank\-8 adapter already recovers most of the gain\.
- •Adaptation as a simple probe of what the model already knows\.Because the adapter starts at the exact zero\-shot model, how much it helps \(and whether it helps at all\) shows how much of a language the base model already learned\. The useful rank grows with this gap\. This gives clear advice \(rank 64 for Khmer; do not fine\-tune Korean for overall quality, and avoid rank 64 or higher\) and shows that one global rank is wrong when languages differ\.

## IIRelated Work

### II\-ANeural Text\-to\-Speech

Modern TTS began with two\-stage neural pipelines: Tacotron 2\[[1](https://arxiv.org/html/2606.26618#bib.bib1)\]predicts mel\-spectrograms autoregressively and pairs them with a neural vocoder, while FastSpeech 2\[[2](https://arxiv.org/html/2606.26618#bib.bib2)\]introduced non\-autoregressive synthesis with explicit duration, pitch, and energy modeling\. Fully end\-to\-end systems such as VITS\[[3](https://arxiv.org/html/2606.26618#bib.bib3)\]combine variational inference with adversarial training and normalizing flows\. Large generative models then recast TTS as a conditional language\-modeling or diffusion problem: VALL\-E\[[4](https://arxiv.org/html/2606.26618#bib.bib4)\]frames zero\-shot TTS as neural\-codec language modeling\. VoxCPM\[[5](https://arxiv.org/html/2606.26618#bib.bib5)\]departs from discrete\-codec approaches with a*tokenizer\-free*design that models continuous acoustic representations, and VoxCPM2\[[6](https://arxiv.org/html/2606.26618#bib.bib6)\]scales this to a 2\.4B\-parameter model that pairs a MiniCPM\-4\[[12](https://arxiv.org/html/2606.26618#bib.bib12)\]backbone with a flow\-matching\[[13](https://arxiv.org/html/2606.26618#bib.bib13)\]diffusion decoder\. We adopt VoxCPM2 as our base model\.

### II\-BMultilingual and Low\-Resource TTS

Extending speech technology to under\-resourced languages is a long\-standing challenge\. Large\-scale corpus efforts, such as the Massively Multilingual Speech project\[[7](https://arxiv.org/html/2606.26618#bib.bib7)\]and Common Voice\[[14](https://arxiv.org/html/2606.26618#bib.bib14)\], have broadened language coverage, while zero\-shot and cross\-lingual systems such as YourTTS\[[15](https://arxiv.org/html/2606.26618#bib.bib15)\]and XTTS\[[16](https://arxiv.org/html/2606.26618#bib.bib16)\]transfer to new speakers and languages from limited data\. Nevertheless, languages such as Khmer remain data\-scarce, and quality for nominally supported low\-resource languages typically lags that of high\-resource languages, motivating targeted adaptation\.

### II\-CTTS Adaptation

A body of work adapts pretrained TTS models to new speakers, styles, or languages from limited data\[[18](https://arxiv.org/html/2606.26618#bib.bib18)\]\. AdaSpeech\[[17](https://arxiv.org/html/2606.26618#bib.bib17)\], for example, adapts a model while updating only a small set of parameters, foreshadowing parameter\-efficient approaches\. These methods establish that high\-quality adaptation need not retrain the full model\. Our work extends this line to lightweight, jointly multilingual adaptation of a 2\.4B\-parameter foundation TTS model, where a single shared low\-rank adapter serves two typologically distinct, under\-resourced languages at once\.

### II\-DParameter\-Efficient Fine\-Tuning

Parameter\-efficient fine\-tuning adapts large pretrained models by updating only a small subset of their parameters\. LoRA\[[10](https://arxiv.org/html/2606.26618#bib.bib10)\]injects trainable low\-rank matrices into otherwise frozen weights, and QLoRA\[[11](https://arxiv.org/html/2606.26618#bib.bib11)\]further cuts memory by quantizing the backbone\. PEFT is standard for adapting large language models and is increasingly applied to speech\[[19](https://arxiv.org/html/2606.26618#bib.bib19)\]\. Its use for closing the quality gap in under\-resourced TTS, and whether a single shared adapter can jointly serve multiple typologically distinct languages, has received limited attention; we address this directly\.

## IIIMethodology

![Refer to caption](https://arxiv.org/html/2606.26618v1/x1.png)Figure 1:Proposed shared\-LoRA fine\-tuning pipeline for VoxCPM2\.### III\-AModel

Figure[1](https://arxiv.org/html/2606.26618#S3.F1)gives an overview of our pipeline\. We build on VoxCPM2\[[6](https://arxiv.org/html/2606.26618#bib.bib6)\], a tokenizer\-free TTS model with about2\.39×1092\.39\\times 10^\{9\}parameters\. The input is a text prompt, which is normalized, segmented, and BPE\-tokenized before entering the model\. VoxCPM2 has two parts\. The first is a MiniCPM\-4\[[12](https://arxiv.org/html/2606.26618#bib.bib12)\]language\-model backbone \(hidden size 2048, 28 transformer layers plus 8 residual layers, 16 attention heads with 2 key/value heads, vocabulary 73,440\), in which a text\-semantic stage \(TSLM\), a finite scalar quantization stage \(FSQ\), and a residual\-acoustic stage \(RALM\) turn the tokens into an acoustic representation\. The second is a flow\-matching\[[13](https://arxiv.org/html/2606.26618#bib.bib13)\]diffusion transformer \(DiT\) decoder, shown as a local DiT \(LocDiT\) followed by the AudioVAE V2 vocoder, that produces continuous acoustic features \(feature dimension 64, patch size 4\) and renders 48 kHz audio\. Unlike systems that use discrete codec tokens\[[4](https://arxiv.org/html/2606.26618#bib.bib4)\], VoxCPM2 predicts continuous features directly, which avoids codec quantization artifacts\.

### III\-BCorpus

We build one corpus for Khmer \(km\) and Korean \(ko\) from public and in\-house sources, summarized in Table[I](https://arxiv.org/html/2606.26618#S3.T1): a Khmer corpus provided by the Institute of Digital Research & Innovation \(IDRI\), Cambodia; the Korean Single Speaker \(KSS\) corpus\[[20](https://arxiv.org/html/2606.26618#bib.bib20)\]; and Korean Common Voice/FLEURS\[[14](https://arxiv.org/html/2606.26618#bib.bib14),[8](https://arxiv.org/html/2606.26618#bib.bib8)\]\. We prepare the data in four steps\. \(i\)*Aggregation*: we pair each clip with its transcript and measure its duration\. \(ii\)*Cleaning*: we drop clips shorter than0\.50\.5s or longer than2020s and check that audio and text match\. \(iii\)*Tokenization*: we add a language tag \(\[km\]or\[ko\]\) to the front of each transcript and encode it with the VoxCPM2 tokenizer \(vocabulary 73,440\); we drop clips with more than 256 text tokens, leaving 3,717 Khmer and 15,658 Korean clips\. \(iv\)*Manifest construction*: we split each language 90/10 into train and validation, then repeat \(upsample\) the Khmer training clips until Khmer is 40% of the training mix, to make up for its scarcity\. This gives 23,487 training clips \(9,395 Khmer / 14,092 Korean\) and 1,938 validation clips \(372 Khmer / 1,566 Korean\)\. We keep the validation split at its natural ratio, so the loss is measured fairly\.

TABLE I:Composition of the Training Corpus by Language\.
### III\-CJoint Multilingual LoRA Adaptation

Rather than fine\-tuning all parameters, we attach a single shared LoRA\[[10](https://arxiv.org/html/2606.26618#bib.bib10)\]adapter to the frozen backbone\. For a pretrained weight matrixW0∈ℝd×kW\_\{0\}\\in\\mathbb\{R\}^\{d\\times k\}, LoRA constrains the update to a low\-rank product:

W=W0\+ΔW=W0\+αrBA,W=W\_\{0\}\+\\Delta W=W\_\{0\}\+\\frac\{\\alpha\}\{r\}BA,\(1\)HereA∈ℝr×kA\\in\\mathbb\{R\}^\{r\\times k\}uses the Kaiming\-uniform initialization andB∈ℝd×rB\\in\\mathbb\{R\}^\{d\\times r\}is set to zero\. ThusΔW=0\\Delta W=0at the start, and training begins exactly at the original \(zero\-shot\) model\. We add the adapter to the query, key, value, and output projections of attention in*both*the language model \(its base and residual layers\) and the DiT decoder\. The feed\-forward linears and the audio VAE stay frozen\. We setα=2r\\alpha=2rand try ranks 8, 16, 32, 64, and 128\. This is4\.54\.5to72\.472\.4million trainable parameters, or0\.190\.19to3\.033\.03percent of the base model\.

The central design choice is to train*one*adapter on Khmer and Korean*together*from the language\-tagged data\. A single set of low\-rank matrices therefore learns both scripts\. The language tag tells the model which language it is reading, so the adapter can share capacity while still keeping the scripts apart\. No separate model or adapter is needed for each language\.

### III\-DTraining Configuration

We train every adapter with the AdamW optimizer \(β1=0\.9\\beta\_\{1\}\{=\}0\.9,β2=0\.999\\beta\_\{2\}\{=\}0\.999, weight decay0\.010\.01\)\. The peak learning rate is1×10−41\\times 10^\{\-4\}, with a 200\-step linear warmup and then cosine decay to zero\. The effective batch size is 16 \(micro\-batch 4, gradient accumulation 4\), we clip gradients at1\.01\.0, and we use mixed\-precision \(bfloat16\) training; the audio VAE stays in float32\. Each run is 10,000 steps, validated every 500 steps\. We use one NVIDIA H200 GPU, and each rank takes about 2\.6 hours \(around 1\.07 steps per second\)\.

### III\-EEvaluation Metrics

Our main automatic metric is the*validation flow\-matching loss*\(loss\_diff\), the diffusion objective measured on the held\-out validation split\. A lower value means a better fit to the target speech\. We also track the stop\-token loss \(loss\_stop\)\. Because the adapter starts at zero, the loss at the start of training equals the zero\-shot base model, so the drop during training shows how much of the gap the adapter closes\. We also synthesize speech: for each rank and for the base, we generate the same Khmer and Korean sentences at 48 kHz\. Finally, we run MOS listening tests for both languages \(Tables[III](https://arxiv.org/html/2606.26618#S4.T3)and[IV](https://arxiv.org/html/2606.26618#S4.T4)\)\. For each language, five native speakers \(male and female\) rate every system on a 5\-point scale along three axes, naturalness, prosody, and pronunciation, over 20 sentences\. Letrs,a,ir\_\{s,a,i\}be the score that rateriigives systemsson axisa∈\{nat,pros,pron\}a\\in\\\{\\mathrm\{nat\},\\mathrm\{pros\},\\mathrm\{pron\}\\\}, and letm¯s,a=1N∑irs,a,i\\bar\{m\}\_\{s,a\}=\\frac\{1\}\{N\}\\sum\_\{i\}r\_\{s,a,i\}be its mean\. The overall MOS is the mean of the three axes:

MOSs=13∑am¯s,a\.\\mathrm\{MOS\}\_\{s\}=\\frac\{1\}\{3\}\\sum\_\{a\}\\bar\{m\}\_\{s,a\}\.\(2\)We compare each system with the zero\-shot base using a paired Wilcoxon signed\-rank test\. We also give 95% confidence intervals from a bootstrap \(10,000 resamples\) and report inter\-rater agreement with Krippendorff’sα\\alpha\. The two languages used different rater panels, so we compare systems only*within*a language, never across languages\.

## IVResults and Discussion

### IV\-ARank Sweep and Perceptual Results

Table[II](https://arxiv.org/html/2606.26618#S4.T2)shows the trainable fraction, adapter size, and validation loss at 10,000 steps for the base and each rank\. Tables[III](https://arxiv.org/html/2606.26618#S4.T3)and[IV](https://arxiv.org/html/2606.26618#S4.T4)show the native\-speaker MOS, by axis, for Korean and Khmer\.

Automatic loss\.The validation flow\-matching loss drops from about0\.830\.83\(zero\-shot\) to between0\.710\.71and0\.730\.73for every rank\. The loss is lowest at rank 128 \(0\.70940\.7094\), while rank 8 has the second\-lowest loss \(0\.72430\.7243\) but uses a1515times smaller \(1818MB\) adapter\.

Khmer quality\.LoRA gives a large gain that grows with rank and then peaks\. Overall MOS rises from3\.8473\.847\(zero\-shot\) to4\.2274\.227at rank 64, a gain of 0\.38 points that is highly significant \(p<0\.001p<0\.001\)\. Ranks 32 and 128 are also significant \(p=0\.001p=0\.001\), while ranks 8 and 16 are not \(p=0\.19p=0\.19and0\.070\.07\)\. The gain shows up on all three axes, and the biggest single jump is in prosody, which rises from3\.763\.76to4\.364\.36, the strongest effect in the study\. This suggests the base model’s main Khmer weakness was rhythm and intonation\. Quality rises up to rank 64 and then falls at rank 128 \(4\.0834\.083\), so capacity beyond rank 64 hurts perceived quality even though the loss keeps dropping\. Loss and listener ratings thus agree that adaptation helps while disagreeing on how much capacity is worthwhile: rank 64 wins perceptually, and even rank 8 \(3\.9103\.910\) already recovers most of the gain\.

Korean quality\.The result is very different\. No adapter significantly improves overall Korean MOS: the best mean, rank 32 \(3\.7573\.757vs\.3\.6503\.650for the base\), is not significant \(p=0\.49p=0\.49\), and the confidence intervals for ranks 8 to 32 all overlap the baseline\. The only significant overall change is*negative*: rank 64 \(3\.4803\.480\) is significantly below the base \(p=0\.02p=0\.02\), as naturalness falls from3\.673\.67to3\.233\.23\. The only local improvement is pronunciation, which reaches4\.034\.03at rank 32 while naturalness and prosody stay flat\. Korean is already well covered by the base model’s pretraining, so there is little gap to close\. At a high rank the extra capacity overfits the small training set and damages skills the model already had\. The stop\-token loss converges below0\.050\.05for all ranks\.

TABLE II:Effect of LoRA Rank on Adapter Size and Validation Flow\-Matching Loss\.Base:2\.39×1092\.39\\times 10^\{9\}params, all frozen except the adapter;α=2r\\alpha\{=\}2r\.

TABLE III:Native\-Speaker Mean Opinion Scores for Korean\.Asterisks denote a statistically significant difference from the zero\-shot base \(paired Wilcoxon signed\-rank test\):p∗<0\.05\{\}^\{\*\}p<0\.05, where rank 64 is a significant*decrease*\. Unmarked rows are not significant\. The highest mean is shown in boldface\.

TABLE IV:Native\-Speaker Mean Opinion Scores for Khmer\.Asterisks denote a statistically significant difference from the zero\-shot base \(paired Wilcoxon signed\-rank test\):p∗⁣∗∗<0\.001\{\}^\{\*\*\*\}p<0\.001,p∗∗<0\.01\{\}^\{\*\*\}p<0\.01\. Unmarked rows are not significant\. The best result is shown in boldface\.

### IV\-BQualitative Synthesis

For each rank and for the base, we synthesize the same Khmer and Korean sentences at 48 kHz\. By ear, the adapted models pronounce Khmer subscript consonant clusters and Korean sound boundaries more reliably than the base, and their rhythm is steadier\. This matches the lower validation loss\. These samples come with the released artifacts\.

### IV\-CDiscussion

Two interpretations follow\. First, the validation loss is an imperfect guide to perceived quality: it is lowest at rank 128, yet Khmer MOS peaks at rank 64, and even the tiny rank\-8 adapter recovers much of the gain \(MOS3\.9103\.910\)\. The rank is thus better chosen by listening than from the training curve\. Second, and more important, the payoff depends on the language\. We read the large Khmer gain alongside the Korean null result, with degradation at high rank, as evidence that the useful adapter size scales with the distance between a language and what the base model already knows: a large distance \(Khmer\) absorbs more capacity, up to rank 64, whereas a small one \(Korean\) gains little and overfits\. Adaptation should therefore target languages where the base model is genuinely weak, and no single global rank suits every language\.

### IV\-DLimitations

Our study has several limitations\. We report MOS for both Khmer and Korean \(Tables[III](https://arxiv.org/html/2606.26618#S4.T3)and[IV](https://arxiv.org/html/2606.26618#S4.T4)\), but a full fine\-tuning upper bound is left for future work\. We also have not yet isolated the value of*sharing*: to show that one joint adapter beats separate per\-language adapters, we would need a comparison at the same parameter budget, which we leave for future work\. The ratings are also noisy\. Inter\-rater agreement is low \(Krippendorff’sα=0\.31\\alpha=0\.31for Khmer and0\.260\.26for Korean\), so the Korean null result is especially uncertain\. Each panel has only five raters and 20 sentences per system, and the two languages used different panels, so only within\-language trends are valid, not absolute cross\-language scores\. The corpus is small \(about 26 hours, with single\-speaker Korean from KSS\), and the Khmer gains come partly from upsampling, not from more unique data\. Finally, the two languages differ not only in pretraining coverage but also in fine\-tuning*data source*: Korean uses open public corpora \(KSS and Common Voice/FLEURS\), while Khmer uses a private in\-house corpus\. The two sets may differ in recording quality, speaker variety, and amount\. The Khmer\-versus\-Korean difference therefore conflates two factors: \(i\) how far each language is from what the base model knows, and \(ii\) the fine\-tuning data itself\. We do not fully separate these\. This is another reason our claims hold within a language, not as a controlled comparison across languages\.

## VConclusion

We studied the quality gap that large pretrained TTS models show for low\-resource languages, using Khmer and Korean with VoxCPM2\. We built one language\-tagged corpus from several sources and trained a single shared LoRA adapter on both languages, adding low\-rank updates to both the language model and the decoder\. This lowered the validation loss from about0\.830\.83\(zero\-shot\) to as low as0\.7090\.709and raised Khmer MOS from3\.853\.85to4\.234\.23\(an improvement of 0\.38 points,p<0\.001p<0\.001\), while training only0\.190\.19to3\.033\.03percent of the parameters\. The rank sweep shows that the loss is lowest at rank 128, but Khmer MOS is highest at rank 64, and even rank 8 recovers much of the gain\. For Korean, which the base model already covers well, no adapter improves overall quality and a high rank makes it worse\. The contrast between the two languages indicates that small LoRA adapters help most where the base model starts out weakest\. Future work will add a full fine\-tuning upper bound and per\-language rank selection, and strengthen evaluation with larger rater panels, objective metrics, and more low\-resource languages\.

## References

- \[1\]J\. Shenet al\., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” inProc\. IEEE ICASSP, 2018, pp\. 4779–4783\.
- \[2\]Y\. Ren, C\. Hu, X\. Tan, T\. Qin, S\. Zhao, Z\. Zhao, and T\.\-Y\. Liu, “FastSpeech 2: Fast and high\-quality end\-to\-end text to speech,” inProc\. Int\. Conf\. Learn\. Represent\. \(ICLR\), 2021\.
- \[3\]J\. Kim, J\. Kong, and J\. Son, “Conditional variational autoencoder with adversarial learning for end\-to\-end text\-to\-speech,” inProc\. Int\. Conf\. Mach\. Learn\. \(ICML\), 2021, pp\. 5530–5540\.
- \[4\]C\. Wanget al\., “Neural codec language models are zero\-shot text to speech synthesizers,” 2023, arXiv:2301\.02111\. \[Online\]\. Available: https://arxiv\.org/abs/2301\.02111
- \[5\]Y\. Zhouet al\., “VoxCPM: Tokenizer\-free TTS for context\-aware speech generation and true\-to\-life voice cloning,” 2025, arXiv:2509\.24650\. \[Online\]\. Available: https://arxiv\.org/abs/2509\.24650
- \[6\]Y\. Zhouet al\., “VoxCPM2 technical report,” 2026, arXiv:2606\.06928\. \[Online\]\. Available: https://arxiv\.org/abs/2606\.06928
- \[7\]V\. Pratapet al\., “Scaling speech technology to 1,000\+ languages,” 2023, arXiv:2305\.13516\. \[Online\]\. Available: https://arxiv\.org/abs/2305\.13516
- \[8\]A\. Conneauet al\., “FLEURS: Few\-shot learning evaluation of universal representations of speech,” inProc\. IEEE Spoken Lang\. Technol\. Workshop \(SLT\), 2022, pp\. 798–805\.
- \[9\]R\. Buoy, N\. Taing, and S\. Kor, “Joint Khmer word segmentation and part\-of\-speech tagging using deep learning,” 2021, arXiv:2103\.16801\. \[Online\]\. Available: https://arxiv\.org/abs/2103\.16801
- \[10\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen, “LoRA: Low\-rank adaptation of large language models,” inProc\. Int\. Conf\. Learn\. Represent\. \(ICLR\), 2022\.
- \[11\]T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” inAdv\. Neural Inf\. Process\. Syst\. \(NeurIPS\), 2023\.
- \[12\]S\. Huet al\., “MiniCPM: Unveiling the potential of small language models with scalable training strategies,” 2024, arXiv:2404\.06395\. \[Online\]\. Available: https://arxiv\.org/abs/2404\.06395
- \[13\]Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le, “Flow matching for generative modeling,” inProc\. Int\. Conf\. Learn\. Represent\. \(ICLR\), 2023\.
- \[14\]R\. Ardilaet al\., “Common Voice: A massively\-multilingual speech corpus,” inProc\. Lang\. Resour\. Eval\. Conf\. \(LREC\), 2020, pp\. 4218–4222\.
- \[15\]E\. Casanova, J\. Weber, C\. D\. Shulby, A\. C\. Junior, E\. Gölge, and M\. A\. Ponti, “YourTTS: Towards zero\-shot multi\-speaker TTS and zero\-shot voice conversion for everyone,” inProc\. Int\. Conf\. Mach\. Learn\. \(ICML\), 2022, pp\. 2709–2720\.
- \[16\]E\. Casanovaet al\., “XTTS: A massively multilingual zero\-shot text\-to\-speech model,” 2024, arXiv:2406\.04904\. \[Online\]\. Available: https://arxiv\.org/abs/2406\.04904
- \[17\]M\. Chen, X\. Tan, B\. Li, Y\. Liu, T\. Qin, S\. Zhao, and T\.\-Y\. Liu, “AdaSpeech: Adaptive text to speech for custom voice,” inProc\. Int\. Conf\. Learn\. Represent\. \(ICLR\), 2021\.
- \[18\]S\. Ö\. Arık, J\. Chen, K\. Peng, W\. Ping, and Y\. Zhou, “Neural voice cloning with a few samples,” inAdv\. Neural Inf\. Process\. Syst\. \(NeurIPS\), 2018, pp\. 10019–10029\.
- \[19\]Z\.\-C\. Chen, C\.\-L\. Fu, C\.\-Y\. Liu, S\.\-W\. Li, and H\.\-y\. Lee, “Exploring efficient\-tuning methods in self\-supervised speech models,” inProc\. IEEE Spoken Lang\. Technol\. Workshop \(SLT\), 2022, arXiv:2210\.06175\.
- \[20\]K\. Park, “KSS dataset: Korean single speaker speech dataset,” 2018\. \[Online\]\. Available: https://www\.kaggle\.com/datasets/bryanpark/korean\-single\-speaker\-speech\-dataset
Closing the Quality Gap in Low-Resource Text-to-Speech: LoRA Fine-Tuning of VoxCPM2 for Khmer and Korean

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

Tested out VoxCPM2 (Open-Source TTS) locally. The "Ultimate Cloning" mode capturing breathing/accents is getting insane.

Which LoRA? An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning

OpenBMB/VoxCPM

Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection

Submit Feedback

Similar Articles

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training
Tested out VoxCPM2 (Open-Source TTS) locally. The "Ultimate Cloning" mode capturing breathing/accents is getting insane.
Which LoRA? An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning
Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection