UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

arXiv cs.CL 06/11/26, 04:00 AM Papers
Summary
UR-BERT proposes a Romanized transcription-based text encoder for massively multilingual TTS, scaling to 495 languages by using universal Romanization and a speech token prediction objective to enhance phonetic alignment and generalization to unseen languages.
arXiv:2606.11681v1 Announce Type: new Abstract: We propose UR-BERT, a Romanized transcription-based text-to-speech (TTS) encoder for massively multilingual TTS systems. Conventional grapheme-to-phoneme (G2P)-based approaches are limited to around 100 languages due to the availability of reliable G2P resources. In contrast, UR-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation. To further enhance phonetic fidelity and text-speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech-aware phonetic representations in a data-efficient manner. Experiments show that TTS systems built on UR-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:40 PM
# Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction
Source: [https://arxiv.org/html/2606.11681](https://arxiv.org/html/2606.11681)
Lee Ahn Choi Kang

###### Abstract

We propose UR\-BERT, a Romanized transcription\-based text\-to\-speech \(TTS\) encoder for massively multilingual TTS systems\. Conventional grapheme\-to\-phoneme \(G2P\)\-based approaches are limited to around 100 languages due to the availability of reliable G2P resources\. In contrast, UR\-BERT scales to 495 languages by unifying diverse writing systems into a shared Romanization representation\. To further enhance phonetic fidelity and text–speech alignment, we introduce a speech token prediction objective during training, which encourages the encoder to learn speech\-aware phonetic representations in a data\-efficient manner\. Experiments show that TTS systems built on UR\-BERT consistently outperform recent text encoder baselines across a wide range of languages and resource conditions, and demonstrate strong generalization to unseen languages\.

###### keywords:

self\-supervised learning, text\-to\-speech, romanization, multilingualism

### 1Introduction

Neural text\-to\-speech \(TTS\) systems have achieved substantial progress across languages and speaking styles\. Most recent approaches adopt encoder–decoder architectures, in which the encoder produces linguistic representations that are transformed into acoustic features or speech waveforms by a decoder\. While decoder models have advanced rapidly with the introduction of flow matching and neural codec language modeling\[tacotron,fastspeech2,glowtts,vits,gradtts,difftts,matchatts,f5tts,valle,speartts\], encoder design has received comparatively less attention\. In particular, prior work primarily focused on phonetic adequacy for reliable text–speech alignment, a recurring challenge in TTS\.

Meanwhile, advances in self\-supervised learning have demonstrated strong empirical performance across diverse domains\[bert,albert,roberta,wav2vec2,hubert,wavlm\], fostering growing interest in the pretraining of text encoders for TTS\. These models capture rich contextual and semantic information beyond purely phonetic cues, leading recent TTS systems to incorporate BERT\-style representations to enhance naturalness\. Prior studies\[ebert1,ebert2,ebert3,ebert4\]incorporated BERT embeddings as auxiliary inputs to augment phonetic representations\. However, this approach exposes a structural mismatch between TTS text encoders and general\-purpose language models\. Specifically, TTS systems typically operate at the character\- or phoneme\-level, whereas BERT relies on subword units, creating a granularity discrepancy that complicates precise alignment and representation integration\.

To mitigate this mismatch, subsequent work proposed BERT\-style text encoders pretrained from scratch to better align linguistic representations with TTS requirements\. Early approaches\[pngbert,mpbert\]introduced phoneme\-aware pretraining as a core design principle by jointly modeling grapheme and phoneme units to bridge textual and phonetic spaces\. Building on this paradigm, a later extension\[plbert\]streamlined the framework by incorporating both grapheme and phoneme information only during pretraining while restricting downstream usage to phoneme inputs\. More recently, studies\[styletts2,xphonebert\]have extended phoneme\-level pretraining to multilingual settings, demonstrating that phoneme\-based language modeling remains effective even when trained on multilingual corpora\.

Despite their effectiveness, these models inherently rely on G2P toolkits\[phonemizer,charsiug2p\]to generate phoneme sequences, creating a systemic dependency that significantly constrains scalability\. This reliance poses a major obstacle to achieving truly global coverage, as G2P systems are typically available for only around 100 languages, leaving the vast majority of the world’s languages unsupported\. In addition, encoders pretrained solely on textual corpora lack exposure to acoustic contexts, preventing them from capturing fine\-grained prosodic and speech\-related cues that are critical for high\-quality TTS synthesis\.

To address these challenges, we propose UR\-BERT111Official implementation: https://github\.com/sanghyang00/ur\-bert, a speech\-aware pretrained text encoder for massively multilingual TTS covering 495 languages\. We adopt Romanization as a language\-agnostic textual interface in place of language\-specific G2P systems, enabling scalable coverage beyond the limitations of existing G2P pipelines\. To further enhance phonetic modeling, we introduce a knowledge distillation objective based on speech token prediction\. Specifically, a multilingual speech self\-supervised model \(S3M\) serves as the teacher, and UR\-BERT is trained to predict its output tokens, aligning textual representations with rich acoustic latent spaces\. This alignment mitigates the phonetic abstraction introduced by Romanization and narrows the text\-speech modality gap, achieving both scalability across languages and high phonetic fidelity\.

In experiments, UR\-BERT consistently outperforms prior BERT\-style TTS encoders across a broad range of languages and evaluation metrics in both high\- and low\-resource settings, while supporting substantially more languages without compromising synthesis quality\. Furthermore, the model maintains strong performance even with reduced amounts of pretraining data, underscoring the effectiveness of integrating Romanization with the proposed speech\-aware pretraining strategy\.

![Refer to caption](https://arxiv.org/html/2606.11681v1/figures/model.png)Figure 1:Overview of the UR\-BERT showing pretraining and finetuning stage\.Our contributions are summarized as follows:

- •We propose UR\-BERT, a multilingual text encoder for TTS, pretrained on speech–text pairs covering 495 languages\.
- •We overcome the language coverage limitations of existing G2P pipelines by adopting Romanization as a unified orthographic interface for massively multilingual TTS\.
- •We introduce a novel speech token prediction\-based pretraining strategy that aligns BERT\-style text representations with acoustic information, enabling high\-quality TTS synthesis\.

### 2Related Work

To extend monolingual text embeddings to multilingual TTS encoders, recent work has adopted BERT\-style pretraining for text representations\. An early effort in this direction is multilingual PLBERT \(m\-PLBERT\), introduced in the StyleTTS2\[styletts2\]framework\.222https://huggingface\.co/papercup\-ai/multilingual\-pl\-bertFollowing the original PL\-BERT\[plbert\]design, m\-PLBERT pretrains the text encoder on phoneme sequences from 15 languages, generated using Phonemizer\[phonemizer\]\. However, its language coverage is limited to relatively high\-resource languages, including English, Chinese, and several European languages\. Subsequently, XPhoneBERT\[xphonebert\]extended this paradigm by pretraining on phoneme sequences from 88 languages using CharsiuG2P\[charsiug2p\]\. While it substantially increases language coverage, the pretraining data remain concentrated in European and Asian languages, with limited representation of many African and Indigenous American languages\.

Scaling these approaches to a truly massive number of languages remains challenging due to their strong reliance on G2P systems\. High\-quality rule\-based G2P modules are scarce, and even existing toolkits cover only a small fraction of the world’s languages\. Moreover, zero\-shot neural G2P alternatives often exhibit unstable performance, further limiting their applicability to previously unseen languages\.

### 3Proposed Method

#### 3\.1Architecture Overview

The key distinctions of the proposed UR\-BERT lie in its language scalability and training objectives\. UR\-BERT adopts Romanization as a unified text representation, enabling scalable modeling across diverse writing systems without reliance on G2P systems\. It is pretrained on speech–text paired data spanning 495 languages, using a standard BERT\-base architecture\[bert\]with a character\-level tokenizer and 12 Transformer encoder layers\[transformer\]\. In addition to the conventional masked language modeling \(MLM\) objective, UR\-BERT incorporates speech token prediction \(STP\) as an auxiliary objective, injecting text\-conditioned acoustic information during pretraining\. Figure[1](https://arxiv.org/html/2606.11681#S1.F1)illustrates the pretraining and fine\-tuning pipeline of UR\-BERT, with detailed design choices described in the following subsections\.

#### 3\.2Romanization for Language Scalability

We adopt Romanization to unify diverse orthographic systems into the Latin alphabet due to its superior scalability and token efficiency compared to phoneme\-based approaches\. Phoneme\-based methods rely on G2P systems, which require substantial linguistic expertise to design fine\-grained, language\-specific rules, thereby limiting practical coverage\. As a result, existing G2P toolkits support only around 100 languages, such as 88 languages in CharsiuG2P\[charsiug2p\]and 127 languages in Phonemizer\[phonemizer\]\. In contrast, Romanization enables theoretically unbounded scalability by transliterating diverse writing systems into a shared Latin script, as exemplified by the Uroman toolkit\[uroman\]\. This advantage has been shown to scale to thousands of languages across multiple tasks, including TTS\[xtts\]and automatic speech recognition \(ASR\)\[mms,lamaut\]\.

Conventional G2P systems convert graphemes into phonetic representations using the International Phonetic Alphabet \(IPA\)\[ipa\]\. While IPA representations provide fine\-grained phonetic detail, they require a large and diverse symbol inventory, often spanning thousands of symbols, which substantially increases vocabulary size and complicates tokenization\. For example, some tokenization schemes treat prosodic markers, such as suprasegmentals and diacritics, as independent tokens despite their lack of standalone phonetic meaning, whereas others merge them with neighboring vowel or consonant tokens leading to inconsistent token granularity\. In contrast, Romanization transliterates non\-Latin scripts into Latin characters, limiting the token inventory to approximately 30 alphabetic symbols and avoiding explicit prosodic markers\. This compact token space simplifies tokenization and promotes more stable training\. Moreover, prior work has shown that Romanization retains sufficient phonetic information for a wide range of speech\-related tasks\[xtts,mms,lamaut\], despite the reduced vocabulary size\.

#### 3\.3Speech Token Injection for Phonetic Fidelity

Despite the advantages of Romanization, capturing fine\-grained phonetic distinctions remains challenging due to the limited token inventory compared to IPA, particularly when identical Romanized representation correspond to different pronunciations across languages\. To mitigate this acoustic ambiguity, we introduce a knowledge distillation that injects acoustic token information from a pretrained multilingual speech self\-supervised model \(S3M\)\[xlsr53,xlsr,mms,xeus,omnilingualasr\]into UR\-BERT during pretraining\.

Unlike conventional TTS systems that require clean, curated speech data, our approach leverages large\-scale ASR speech–text pairs by injecting speech\-derived supervision into the text encoder through three steps: \(1\) extracting speech representations from S3M, \(2\) aligning them to character\-level text using forced alignment, and \(3\) discretizing the aligned representations into speech tokens that serve as auxiliary training targets\. Through this process, ASR corpora are reframed as a scalable source of phonetic guidance, enabling TTS models to benefit from data previously unsuitable for speech synthesis\.

Speech Representation Extraction\.We employ the omnilingual\-ASR\-W2V\-300M model333https://huggingface\.co/facebook/omniASR\-W2V\-300Mas the teacher network and extract representations from its 16th layer\. This design choice is motivated by prior findings that intermediate layers of multilingual S3Ms predominantly encode phonetic\-level information rather than high\-level semantic representations\[layerwise1,layerwise2\]\.

![Refer to caption](https://arxiv.org/html/2606.11681v1/figures/fa.png)Figure 2:Illustration of the CTC\-based speech\-text alignment\.CTC\-Based Speech\-Text Alignment\.A key challenge in speech\-text alignment arises from the mismatch in sequence length between speech and text, as acoustic feature sequences are typically much longer than their textual counterparts\. To obtain character\-level acoustic representations, we apply CTC\-based forced alignment using MMS\-FA\[mms\], followed by average pooling over the aligned frames for each character\. The overall alignment procedure is illustrated in Figure[2](https://arxiv.org/html/2606.11681#S3.F2)\.

Discrete Token Assignment\.To discretize continuous character\-level acoustic representations, we performkk\-means clustering over the pretraining corpus to construct a finite codebook\. Each character\-level acoustic representation is assigned to its nearest cluster centroid, yielding a discrete speech token for every Romanized character\. These tokens are used as supervision for the STP objective during pretraining, enabling UR\-BERT to infer acoustic information directly from text input\. We set the codebook size to 257, where index 0 represents a mute token, and indices 1\-256 correspond to acoustic tokens\. Larger codebooks were not considered, as excessive capacity tends to encode speaker\-dependent or paralinguistic variations rather than phonetic content\[selm,diffkmeans\], and may destabilize training due to the mismatch with the compact text vocabulary\. This design choice is further motivated by phonological theory, which models speech sound inventories as economical combinations of a limited number of binary features\[tokenlimit\], striking a balance between representational capacity and phonetic abstraction\.

### 4Experiments

#### 4\.1Pretraining

We construct the pretraining corpus by combining three ASR datasets: FLEURS\[fleurs\], which spans 102 read\-speech languages; Common Voice\[commonvoice\], a crowdsourced dataset covering 131 languages; and the Omnilingual ASR corpus\[omnilingualasr\], which includes 348 low\-resource languages\. The resulting pretraining corpus contains approximately 13K hours of speech across 495 languages, comprising 8M sentences as summarized in Table[1](https://arxiv.org/html/2606.11681#S4.T1)\. Pretraining is conducted for 150K steps with a batch size of 1024 using gradient accumulation\. We employ the AdamW\[adamw\]optimizer with a tri\-stage learning rate schedule\[data2vec,wav2vec2,wavlm\], using warm\-up, peak, and decay ratios of 0\.1, 0\.5, and 0\.4 with a peak learning rate of1e\-41\\text\{e\-\}4\.

#### 4\.2TTS Finetuning

We conduct downstream TTS experiments on 11 languages spanning both high\- and low\-resource settings, with all datasets resampled to 22,050 Hz\. The high\-resource group includes English\[ljspeech\], German\[thorsten\], and Mandarin Chinese\[aishell3\], each with 20 hours of training data\. The low\-resource group consists of eight Asian and African languages: Javanese, Sundanese, Khmer, Nepali\[lowresourcedb1\], Sinhala\[lowresourcedb2\], Afrikaans, Setswana, and Xhosa\[lowresourcedb3\]\. Specifically, we used 5 hours of training data for Javanese and Sundanese, 3 hours for Khmer, 2 hours for Afrikaans, Nepali, Setswana, and Xhosa, and 1 hour for Sinhala, reflecting differences in available data per language\.

For TTS modeling, we adopt VITS\[vits\]as the backbone architecture and compare its original text encoder with existing BERT\-style encoders, including m\-PLBERT and XPhoneBERT, as well as the proposed UR\-BERT\. Low\-resource models are trained for 100K steps, while high\-resource models are trained for 300K steps, both with a batch size of 32, following the training protocols of MMS\-TTS and XPhoneBERT, respectively\. The optimization settings largely follow the pretraining configuration, except that the text encoder is frozen for the first 25% of training steps and the warm\-up schedule is omitted to stabilize the monotonic alignment search module\.

Table 1:Comparison of baselines and UR\-BERT\. Dataset amounts are measured with sentences and hours, respectively\.- ⋆\\starWhile the romanization toolkit \(Uroman\) is theoretically language\-agnostic, we report the maximum number of languages empirically validated in prior studies\.

Table 2:Performance on high\-resource languages\. MPB, XPB, and URB denote m\-PLBERT, XPhoneBERT, and UR\-BERT, respectively\.Δ\\DeltaCER is reported in percentage points, and F0 denotes log\-F0F0RMSE\. Best results are bolded, and second\-best are underlined\.Table 3:Performance on low\-resource languages\. We denote the dataset size for each language, and best results are bolded\.#### 4\.3Performance Metrics

We evaluate UR\-BERT using one subjective and four objective metrics to provide a comprehensive assessment\. For subjective evaluation, we conduct a standard mean opinion score \(MOS\) test on a 1\-5 scale with phonetic guidance\. Ratings are collected on 520 samples from 44 participants with diverse regional backgrounds\. For objective quality assessment, we employ the UTokyo MOS Prediction System \(UTMOS\)\[voicemos,utmos\]\. To mitigate potential cross\-lingual bias, we report relative degradation \(Δ\\DeltaUTM\) with respect to ground\-truth \(GT\) samples\. Intelligibility is evaluated using character error rate \(CER\), computed from transcriptions generated by Omnilingual\-ASR\-CTC\-1B444https://huggingface\.co/facebook/omniASR\-CTC\-1B, and is similarly reported as relative degradation \(Δ\\DeltaCER\) against GT speech\. To quantify spectral and prosodic differences between synthesized and GT speech, we additionally report mel\-cepstral distance \(MCD\) and log\-F0F0root mean squared error \(Log\-F0F0RMSE\)\.

### 5Results

#### 5\.1Performance on High\-Resource Languages

Table[2](https://arxiv.org/html/2606.11681#S4.T2)presents TTS evaluation results on high\-resource languages\. While VITS achieves strong baseline performance in these settings, incorporating UR\-BERT consistently improves both subjective and objective metrics across all evaluated languages\. In contrast, m\-PLBERT exhibits notable performance degradation when integrated with VITS, often producing fluent but phonetically inaccurate speech, resulting in elevated CER\. XPhoneBERT achieves competitive performance as a result of large\-scale multilingual pretraining; however, it demonstrates lower naturalness and intelligibility compared to UR\-BERT\.

The performance gains of UR\-BERT are particularly notable given that it is trained on only 2\.5% of the data used by XPhoneBERT \(8M vs\. 330M sentences\)\. We attribute this efficiency to the proposed framework, which combines a compact token space enabled by Romanization with an STP objective\. The reduced vocabulary facilitates data\-efficient learning, while speech\-aware pretraining enhances text–speech alignment and enriches phonetic representations\.

Table 4:Ablation study on STP; We report the MOS of the UR\-BERT with and without the STP\. Bold indicates the higher value\.
#### 5\.2Performance on Low\-Resource Languages

Table[4\.2](https://arxiv.org/html/2606.11681#S4.SS2)presents TTS performance on low\-resource languages\. Group 1 includes languages supported by both XPhoneBERT and UR\-BERT, whereas Group 2 consists of languages supported only by UR\-BERT\. Phoneme\-based baselines often fail to support many low\-resource languages due to the absence of reliable G2P systems\. In contrast, UR\-BERT naturally extends to these languages through Romanization, demonstrating strong scalability\. Across both groups, UR\-BERT consistently achieves the strongest overall performance, obtaining the highest MOS for every language and outperforming baselines on most objective metrics\. The gains are particularly pronounced in terms of naturalness and intelligibility\.

To further evaluate cross\-lingual generalization of UR\-BERT, we conduct additional experiments on Sundanese \(Group 3\), which is excluded from the pretraining corpus\. Despite this zero\-shot setting, UR\-BERT consistently outperforms baseline models, demonstrating robust generalization to languages unseen during pretraining\.

#### 5\.3Effectiveness of Speech Token Prediction Objective

Table[4](https://arxiv.org/html/2606.11681#S5.T4)shows that removing the STP objective consistently degrades performance across nearly all languages, with noticeable MOS drops in both settings\. The effect is particularly pronounced for high\-resource languages, while remaining evident in low\-resource and zero\-shot scenarios\. These results indicate that STP is critical for enriching phonetic representations and stabilizing text–speech alignment, effectively compensating for the phonetic abstraction inherent in Romanization\.

### 6Conclusion

In this paper, we propose UR\-BERT, a multilingual and multimodal pretrained text encoder for text\-to\-speech applications\. By adopting Romanization as a unified text representation, UR\-BERT overcomes the language coverage limitations of conventional G2P pipelines and enables scalable pretraining across 495 languages\. To further enhance phonetic fidelity, we introduce a speech\-token prediction objective that injects acoustic knowledge into text representations and strengthens text–speech modality alignment\. Experimental results demonstrate that UR\-BERT consistently outperforms prior approaches across languages with varying resource availability, while exhibiting strong cross\-lingual generalization\. We envision UR\-BERT as a foundational building block toward truly universal, massively multilingual TTS systems, enabling scalable and inclusive speech synthesis across the world’s languages\.

### 7Generative AI Use Disclosure

All co\-authors attest that generative AI tools were employed exclusively to refine human\-authored text and to support LaTeX formatting of the manuscript, including tables and figures\. No generative AI tools were used in the development of research ideas, analytical procedures, or the creation of any substantive scientific content\. We further reaffirm our commitment to the responsible and ethical use of generative AI in accordance with established research ethics\.

### 8Acknowledgement

This work was supported by the National Research Foundation of Korea \(NRF\) grant funded by the Korea government Ministry of Science and ICT \(MSIT\) \(RS\-2026\-25468664\)\.

### References

## Appendix

![[Uncaptioned image]](https://arxiv.org/html/2606.11681v1/figures/mos_instruction.png)

\(a\) Instructions provided to the participants\.

![[Uncaptioned image]](https://arxiv.org/html/2606.11681v1/figures/mos_protocol.png)

\(b\) Interface of the MOS evaluation platform\.

Figure 3:MOS evaluation setup: \(a\) detailed instructions for quality assessment, and \(b\) a snapshot of the evaluation interface\.
### Appendix ADetails on the Pretraining Dataset

Preprocessing\.We removed samples whose transcriptions contained digits or parenthetical expressions\. Digits may correspond to different pronunciations across languages, leading to token–pronunciation mismatches, while parenthetical content is inconsistently realized in speech\. The Omnilingual ASR corpus generally contains longer utterances, as its collection protocol was designed to elicit natural, prompt\-based responses\. To reduce computational overhead and excessive padding during training, we segmented the Omnilingual ASR samples into chunks of up to 30 seconds using MMS\-FA, aligning their duration distribution with that of FLEURS and CommonVoice\.

Configuration\.As described in Section[4\.1](https://arxiv.org/html/2606.11681#S4.SS1), we combined three ASR\-oriented speech–text paired datasets to construct a large\-scale pretraining corpus for UR\-BERT\. Language\-specific details, including language names, ISO\-639\-3 codes, and dataset sizes, are provided in Table[6](https://arxiv.org/html/2606.11681#A2.T6), listed in alphabetical order\.

### Appendix BDetails on MOS Evaluation Protocol

Participants\.We recruited 44 participants through community outreach\. Given the multilingual scope of our evaluation, which spans 11 languages across diverse geographic regions \(e\.g\., the United States, the United Kingdom, Germany, China, as well as regions in Africa and Asia\), we selected participants with demonstrated familiarity with at least a subset of the evaluated languages\. All participants were fluent in English, and several had lived in or had prior exposure to other language regions \(e\.g\., Germany or Indonesia\), ensuring reliable judgments\.

Evaluation Procedure\.To facilitate faithful evaluation, we provided both grapheme transcriptions and Romanized transcriptions for each utterance as phonetic guidance, allowing participants to assess whether the synthesized speech aligned with the intended text\. Language information was also displayed for each sample\. Participants were instructed to evaluate in a quiet environment using high\-quality personal audio equipment \(e\.g\., headphones\)\. The use of loudspeakers was prohibited to avoid variability in acoustic perception\. The detailed survey interface is illustrated in Figure[B](https://arxiv.org/html/2606.11681#A2)\.

Table 5:Details of the TTS models; 41 models in total\.
Stimuli and Assignment\.We randomly sampled 10 utterances per configuration\. Sample identities were shared across models within each language to enable fair comparison \(e\.g\., VITS, m\-PLBERT, XPhoneBERT, and UR\-BERT variants for English used identical text prompts\)\. As shown in Table[5](https://arxiv.org/html/2606.11681#A2.T5), this resulted in 520 evaluated samples from 110 ground truth and 410 generated samples\. To mitigate listener fatigue and maintain consistent scoring criteria, the samples were divided into four non\-overlapping groups of 130 utterances each \(Groups A, B, C, and D\)\. Participants were evenly assigned to groups \(11 per group\), ensuring that no evaluator assessed overlapping samples\.

LanguageISO CodeDurationSentenceLanguageISO CodeDurationSentenceLanguageISO CodeDurationSentenceAbadikbt5\.911828Abkhazianabk31\.8621037Abronabr5\.101908Abuaabn6\.342098Afadeaal6\.882073Afrikaansafr2\.41939Agwagwuneyay4\.621753Akanaka0\.22205Akebukeu6\.722389Alagoala6\.832138Algerian Arabicarq9\.662081Ambonese Malayabs6\.492216Amharicamh9\.132955Anaanganw6\.872696Angikaanp5\.071404Antankarana Malagasyxmv12\.752589Arabic, Algerian Saharanaao1\.96449Arabic, Dhofariadf0\.3166Arabic, Judeo\-Moroccanaju7\.141235Arbëreshë Albanianaae9\.753788Armenianhye21\.2811668Asheahs7\.032099Askopaneiv5\.641548Assameseasm9\.043029Asturianast6\.162376Awakawo6\.062010Ayacucho Quechuaquy0\.0426Azerbaijaniaze7\.112166Bacamabcy5\.571801Badebde6\.072127Bago\-Kusuntubqg6\.552276Baharna Arabicabv10\.261734Balanta\-Ganjabjt6\.372276Bangwinjibsj6\.402287Banjarbjn6\.722561Bara Malagasybhr11\.892698Barokbjk6\.211597Basa \(Cameroon\)bas2\.162109Basa \(Nigeria\)bzw6\.562443Bashkirbak142\.96119000Basqueeus205\.87130034Batak Mandailingbtm6\.672329Bayotbda6\.011461Belarusianbel483\.05349582Bengaliben41\.9523777Betawibew6\.762482Bhilibhb6\.751922Bhojpuribho5\.381335Bilurbxf6\.361864Bimabhp6\.471805Bodo \(India\)brx6\.521957Boghombux5\.321870Bokyibky5\.341761Bomubmq10\.242649Bondeibou6\.191948Borgu Fulfuldefue9\.352218Bosnianbos7\.562427Brahuibrh5\.591879Brajbra6\.321975Bretonbre3\.433509Budumabdm6\.342018Buginesebug7\.231739Bukharicbhh7\.441587Bulgarianbul14\.187244Bundelibns5\.551655Bura\-Pabirbwr7\.072449Burakbys6\.462038Burmesemya9\.092353Cacaloxtepec Mixtecmiu5\.991617Cakfem\-Musherecky5\.511931Campidanese Sardiniansro6\.362250Catalancat1809\.271203445Cebuanoceb9\.252563Cencen6\.152268Central Kurdishckb16\.9510243Central Nahuatlnhn5\.021087Central Pamepbs6\.381988Central Pashtopst19\.777735Central\-Eastern Niger Fulfuldefuq3\.751029Chadian Arabicshu2\.29319Chichicapan Zapoteczpv5\.622445Chigacgg7\.472387Chimalapa Zoquezoh5\.891463Chimborazo Highland Quichuaqug5\.951731Chitwania Tharuthe5\.421882Chuvashchv1\.991455Cibakckl6\.242096Coastal Konjokjc6\.191863Croatianhrv8\.782684Cross River Mbembemfn5\.902053Cuyamecalco Mixtecxtu6\.131621Czechces35\.1823927Dadiyadbd5\.832021Danishdan9\.915515Dazagadzg5\.712179Deccandcc7\.072668Degemadeg7\.032162Dera \(Nigeria\)kna7\.712869Dghwededgh5\.651930Dhivehidiv3\.812654Dijim\-Bwilimcfa6\.782135Dotyalidty7\.412688Dutchnld59\.9645746Dyuladyu0\.1588D˜uyaldb7\.742199Eastern Bolivian Guaranígui22\.353867Eastern Egyptian Bedawi Arabicavl1\.86311Eastern Krahnkqo5\.422409Eastern Marimhr234\.73185245Eastern Yiddishydd12\.952713Eggonego6\.261962Egyptian Arabicarz15\.343089Ejaghametu6\.352354Elemeelm7\.132060Eloyiafo7\.192167Embuebu5\.461694Englisheng1784\.771129098Erzyamyv1\.971241Esanish5\.891334Esperantoepo246\.13142968Extremaduranext13\.074382Fantifat11\.692951Farefaregur7\.142204Filipinofil5\.891488Filomena Mata\-Coahuitlán Totonactlp7\.331995Finnishfin8\.923961Fipafip7\.002053Frenchfra858\.73595563Fulahful9\.832486Fulfulde, Bagirmifui14\.553622Galicianglg100\.0570715Gambian Wolofwof6\.502416Gandalug119\.3468780Garhwaligbm14\.803086Gbagyigbr8\.162723Gbarigby8\.472619Gejigyz6\.762467Georgiankat96\.3563703Germandeu969\.37610065Geser\-Goromges6\.151463Gheg Albanianaln3\.60591Glavdaglw6\.902298Goan Konkanigom5\.161103Goemaiank6\.492396Golagol5\.291905Guaranigrn1\.151049Guduf\-Gavagdf8\.032302Guerrero Amuzgoamu6\.742058Gujaratiguj6\.612392Gulf Arabicafb18\.762866Gusiiguz5\.642320Gusilaygsl5\.901617Gwenogwe6\.242172Güilá Zapotecztu5\.641999Hahonhah6\.141825Haitianhat0\.0211Hakha Chincnh0\.65817Haköhao6\.632039Haliahla6\.211919Harotihoj8\.462060Hausahau11\.864307Hawaiianhaw6\.161027Hebrewheb8\.213518Hereroher6\.022151Highland Konjokjk6\.262079Hijazi Arabicacw21\.753486Hindihin10\.936539Huaxcaleca Nahuatlnhq5\.941980Hubahbb6\.142290Huitepec Mixtecmxs5\.871570Hulahul6\.102009Hungarianhun63\.6441688Hunjara\-Kaina Kehkk4\.19955Hwanahwo7\.262405Icelandicisl2\.13732Idakho\-Isukha\-Tirikiida6\.282280Idomaidu6\.902665Igboibo9\.492190Igoahl6\.732596Ikposokpo5\.671524Ikwereikw6\.031519Indonesianind14\.546955Interlinguaina4\.734447Irishgle9\.782812Isekiriits7\.892394Isokoiso6\.151789Italianita261\.38175120Itoitw6\.542301Itzáitz4\.731003Ixtayutla Mixtecvmj6\.341767Izonijc5\.841706Jambi Malayjax6\.812335Japanesejpn24\.6516747Jaunsarijns4\.42909Javanesejav8\.582409Jibajuo6\.882637Jjukaj5\.902187Juxtlahuaca Mixtecvmc5\.931755Kabraslkb6\.282027Kabuverdianukea8\.012135Kabylekab144\.06152224Kachi Koligjk4\.461212Kairakckr6\.472338Kalabariijn6\.862351Kalenjinkln13\.3711057Kamba \(Kenya\)kam10\.942618Kamokcq6\.852783Kanaujibjj6\.462290Kanembukbl6\.692265Kannadakan6\.001735Karekarekai7\.092320Kashmirikas4\.94989Kathoriya Tharutkt6\.842241Kazakhkaz9\.683093Keiyoeyo6\.282000Khanaogo6\.662102Khmerkhm5\.231282Kingazga9\.673620Kinnaurikfk4\.311304Kinyarwandakin1309\.82929156Kirghizkir9\.313967Kirya\-Konz@lfkk5\.862171Kochila Tharuthq6\.822392Kohumonobcs7\.102431Kok Boroktrp7\.882482Kol \(Papua New Guinea\)kol6\.142186Komakmy6\.492431Konkani \(individual language\)knn8\.101802Konzokoo9\.493118Koreankor6\.472154Korwakfp7\.081535Kotakfe9\.503394Kuanuaksd6\.022064Kuanyamakua6\.312269Kui \(India\)uki6\.661680Kulung \(Nigeria\)bbu6\.242285Kuotkto6\.021740Kushikuh6\.752100Kwambikwm6\.642519Lala\-Roballa6\.312226Lamanghia7\.122471Laolao5\.681516Larike\-Wakasihualo6\.121812Latgalianltg5\.504534Latvianlav27\.8215805Levantine Arabicapc5\.751886Liana\-Setiste6\.211603Liberia Kpellexpe6\.212482Liberian Englishlir7\.582239Libyan Arabicayl16\.252484Ligurianlij9\.003460Lijilimgi6\.792511Lingalalin11\.322356Lithuanianlit19\.0210604Logoolirag5\.932053Logudorese Sardiniansrc6\.532006Lolodaloa5\.831272Longudalnu6\.592054Loxicha Zapotecztp6\.292128Luo \(Kenya and Tanzania\)luo12\.185853Lushailus10\.023435Luxembourgishltz6\.241922Maasina Fulfuldeffm10\.411664Mabamde8\.802613Macedonianmkd7\.583882Mafamaf6\.352312Malagasy, Southern Betsimisarakabzc16\.993293Malayalammal8\.733541Maligcc5\.612367Malinaltepec Me'phaatcf6\.212586Maltesemlt9\.874172Mandaratbf5\.871562Mandarin Chinesecmn56\.2739265Mandjakmfv6\.191920Manggaraimqy6\.391702Mansoankamsw6\.102317Maorimri11\.672317Marathimar11\.394158Marghi Centralmrt6\.302290Marghi Southmfm6\.572030Maria \(India\)mrr6\.982403Masikoro Malagasymsh13\.893207Mazaltepec Zapoteczpy6\.101380Mazatlán Mazatecvmz5\.151600Mazatlán Mixemzl6\.231603Mbemfo6\.502330Mekeomek5\.291527Merumer6\.071878Mesopotamian Arabicacm3\.69775Mewarimtr1\.88374Mitlatongo Mixtecvmm6\.451482Miyamkf6\.421930Modern Greekell9\.684417Mokshamdf0\.26175Mom Jangover7\.082372Mongolianmon11\.044467Moroccan Arabicary7\.891658Motumeu6\.652247Musimui6\.582214Nabamne4\.921529Najdi Arabicars19\.103094Naliknal6\.071403Ndongando5\.922300Neapolitannap6\.302632Nepali \(macrolanguage\)nep8\.742919Ngamonbh5\.721566Ngasanc6\.692440Ngizimngi6\.442342Nigerian Fulfuldefuv5\.601727Nimadinoe5\.851971Nobiinfia5\.851595North Mesopotamian Arabicayp6\.53989North Moluccan Malaymax6\.181575Northern Betsimisaraka Malagasybmm18\.864052Northern Hindkohno4\.511727Northern Kurdishkmr5\.845277Northern Pamepmq6\.151417Northern Pashtopbu6\.721023Northern Uzbekuzn11\.842723Norwegian Bokmålnob8\.152600Norwegian Nynorsknno0\.50464Notsincf4\.801146Nyanjanya7\.922093Nyankpayes6\.381838Nzanyinja5\.961806Occitanoci10\.832909Ododk4\.481261Odiaory5\.052665Odualodu6\.171983Omani Arabicacx21\.763408Ormaorc11\.122121Oromoorm5\.031350Ossetianoss0\.66414Pahari\-Potwariphr5\.822360Panjabipan6\.132323Papuan Malaypmy6\.522559Pedinso6\.731232Peropip5\.451843Persianfas40\.7932225Petatspex6\.422262Piemontesepms15\.952790Piya\-Kwoncipiy7\.031888Plateau Malagasyplt19\.054454Polishpol41\.6925813Poqomampoc6\.151699Portuguesepor33\.5924643Pulaarfuc14\.292689Pularfuf13\.693074Pökootpko6\.262278Qaqetbyx5\.842043Quiotepec Chinantecchq6\.382127Rana Tharuthr5\.171397Rangilag6\.232247Rapoisikyx6\.162477Ratahanrth5\.431142Rayón Zoquezor6\.592064Romanianron13\.397429Romansh \(Sursilvan\)roh2\.491591Romansh \(Vallader\)roh1\.00557Romborof6\.542286Rotokasroo6\.031791Russianrus44\.4028714Sacapultecoquv6\.361918Saidi Arabicaec7\.071922Sakalava Malagasyskg8\.591841Salemansau5\.761547Samba Dakaccg6\.201877Samba Lekondi6\.782473San Felipe Otlaltepec Popolocapow6\.241490San Francisco Del Mar Huavehue5\.941542San Juan Atzingo Popolocapoe5\.861627San Martín Itunyoso Triquitrq5\.552022San Miguel El Grande Mixtecmig6\.361744Santa Catarina Albarradas Zapotecztn5\.961748Santali \(Ol Chiki\)sat0\.44333Saposasps5\.771822Saraikiskr1\.501556Sardiniansrd1\.16923Sayasay5\.361552Serbiansrp9\.514162Shonasna7\.481924Siar\-Laksjr5\.931538Sibenco6\.311836Sicilianscn8\.962421Sikkimesesip4\.061085Sinaugorosnc6\.331859Sindhisnd9\.472903Sinhalasin8\.072395Sinicahua Mixtecxti6\.331650Sipacapensequm6\.371850Siwaisiw6\.662386Slovakslk13\.488917Slovenianslv7\.243414Solossol6\.412117Somalisom9\.262362Soninkesnk6\.722760Southeastern Nochixtlán Mixtecmxy5\.741276Southern Pashtopbt6\.821052Soyaltepec Mazatecvmp6\.091613Spanishspa515\.42355892Standard Arabicarb37\.3130271Standard Estonianekk11\.755323Standard Malayzsm7\.282110Standard Moroccan Tamazightzgh0\.73840Sudanese Arabicapd6\.711674Sulkasua6\.171769Swahili \(individual language\)swh79\.6948919Swedishswe15\.4010011Tae'rob5\.661452Tahaggart Tamahaqthv1\.62350Taitadav2\.292096Tajiktgk6\.661898Tamiltam89\.0947680Tandroy\-Mahafaly Malagasytdx1\.71460Tangaletan7\.002617Tanosy Malagasytxy6\.051632Tarokyer6\.152067Tatartat8\.848394Tedagatuq6\.241953Telugutel5\.821780Teoptio6\.011795Tepinapa Chinanteccte6\.642136Terattr6\.392213Tereibuo6\.522065Termanutwu6\.361516Tesaka Malagasytkg12\.962700Tetelcingo Nahuatlnhg6\.251865Thaitha43\.6934889Tidaá Mixtecmtx5\.731327Tidoretvo6\.851459Tigaktgc5\.771442Tigretig2\.941972Tigrinyatir0\.0320Tilquiapan Zapoteczts6\.031689Tinputztpz5\.031462Tlacoapa Me'phaatpl6\.362811Tlacoatzintepec Chinantecctl6\.081979Toki Ponatok2\.552629Tomoiptqp5\.691612Tondanotdn5\.121201Tonseatxs6\.341584Toorottj6\.681976Torauttu5\.442028Tsimihety Malagasyxmw11\.372664Tsotsolto6\.192251Tswanatsn1\.311078Tugentuy6\.681986Tulatul5\.631925Tulutcy6\.932172Tungaglcm5\.931933Tunisian Arabicaeb21\.142986Turkanatuv5\.692116Turkishtur49\.4641561Turkmentuk1\.03739Tututepec Mixtecmtu6\.181658Ubagharabyc7\.052663Uighuruig166\.71107645Ukrainianukr38\.3028879Umbunduumb5\.751034Upper Sorbianhsb1\.48808Urduurd13\.788982Uzbekuzb61\.3949656Vaivai5\.641991Vietnamesevie9\.354398Voticvot0\.1196Võrovro11\.432266Waci Gbewci6\.222223Wadiyara Kolikxp5\.021476Wajawja6\.722135Wangalwg6\.072236Wapanjuk6\.101983Warjiwji7\.942708Welshcym20\.1910269Wemaleweo6\.051381Western Frisianfry5\.533924Western Juxtlahuaca Mixtecjmx6\.502378Western Maninkakanmlq5\.762038Western Marimrj14\.8914325Western Niger Fulfuldefuh6\.251176Western Panjabipnb6\.521855Wolofwol7\.412013Xanaguía Zapotecztg5\.751620Xhosaxho9\.662648Yaceekr7\.202613Yakutsah3\.762195Yalahatanjal7\.932207Yekheeets6\.491808Yorubayor10\.543388Yue Chineseyue23\.2617377Yutanduchi Mixtecmab6\.681951Zacatlán\-Ahuacatlán\-Tepetzintla Nahuatlnhi0\.0323Zarmadje7\.503244Zazazza0\.79734Zuluzul10\.282126Ömieaom4\.36996Table 6:Detailed configuration of the pretraining dataset\. Duration is measured in hours, and sentences denote the number of speech–text pairs\. Language names follow the Omnilingual ASR configuration, and ISO codes follow ISO\-639\-3\.
UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

Similar Articles

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

Real-time multilingual ASR using rolling buffers and monolingual models [P]

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

OpenBMB/VoxCPM

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

Submit Feedback

Similar Articles

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
Real-time multilingual ASR using rolling buffers and monolingual models [P]
From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation
DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection