Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

arXiv cs.CL 06/16/26, 04:00 AM Papers
multilingual tokenization large-language-models fairness efficiency empirical-study southeast-asian-languages
Summary
This paper systematically compares equitable tokenizers for multilingual LLMs across 11 Southeast Asian languages, finding that Parity-aware BPE achieves the best efficiency-equity trade-off and that cross-lingual fairness and tokenization efficiency are not fundamentally at odds.
arXiv:2606.15044v1 Announce Type: new Abstract: Multilingual large language models (LLMs) depend on subword tokenization to bridge discrete text and continuous neural representation. State-of-the-art multilingual LLMs often use Byte-level Byte-Pair Encoding (BPE) tokenizers that structurally favor high-resource languages and Latin scripts. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross-lingual capability gaps. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages. Beyond tokenizer-level analysis of compression efficiency and cross-lingual equity, we assess downstream task performance through controlled 1.5B-parameter language model training using the same training data. Our results show that Parity-aware BPE lies on the Pareto frontier of the efficiency-equity trade-off, achieving strong compression parity at competitive cost. Morphology-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low-resource training data. Together, our findings demonstrate that cross-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:44 AM
# An Empirical Study of Tokenizers for Multilingual Large Language Models
Source: [https://arxiv.org/html/2606.15044](https://arxiv.org/html/2606.15044)
Kieron Seven Jun Wei Lee1Muhammad Reza Qorib2 Andrew Ivan Soegeng1,3Hwee Tou Ng1 1National University of Singapore2Carnegie Mellon University3SAP e0968891@u\.nus\.edu,mrqorib@cmu\.edu, andrew\.soegeng@u\.nus\.edu,dcsnght@nus\.edu\.sg

###### Abstract

Multilingual large language models \(LLMs\) depend on subword tokenization to bridge discrete text and continuous neural representation\. State\-of\-the\-art multilingual LLMs often use Byte\-level Byte\-Pair Encoding \(BPE\) tokenizers that structurally favor high\-resource languages and Latin scripts\. For speakers of underrepresented languages, particularly those across Southeast Asia, this bias inflates inference costs and widens cross\-lingual capability gaps\. We present the first systematic comparison of equitable tokenizers on a unified benchmark spanning 11 Southeast Asian languages\. Beyond tokenizer\-level analysis of compression efficiency and cross\-lingual equity, we assess downstream task performance through controlled 1\.5B\-parameter language model training using the same training data\. Our results show that Parity\-aware BPE lies on the Pareto frontier of the efficiency\-equity trade\-off, achieving strong compression parity at competitive cost\. Morphology\-Driven Byte Encoding delivers the best semantic reasoning performance through morphologically richer representations, albeit at a higher computational expense\. Byte Latent Transformer underperforms on downstream tasks, possibly because its architectural assumptions misalign with the constraints of limited low\-resource training data\. Together, our findings demonstrate that cross\-lingual fairness and tokenization efficiency are not fundamentally at odds, and offer practical guidance for designing equitable multilingual models\.111Source code will be publicly released upon paper publication\.

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

Kieron Seven Jun Wei Lee1Muhammad Reza Qorib2Andrew Ivan Soegeng1,3Hwee Tou Ng11National University of Singapore2Carnegie Mellon University3SAPe0968891@u\.nus\.edu,mrqorib@cmu\.edu,andrew\.soegeng@u\.nus\.edu,dcsnght@nus\.edu\.sg

## 1Introduction

Multilingual large language models \(LLMs\) are central to cross\-lingual information access, yet their performance remains deeply uneven across languages and scripts\. A key driver of this disparity is tokenization: how raw text is segmented into subword units shapes model capacity, sequence length, and effective context window across languages\(Petrovet al\.,[2023](https://arxiv.org/html/2606.15044#bib.bib3)\)\.

Byte\-level Byte\-Pair Encoding \(BPE\)\(Sennrichet al\.,[2016](https://arxiv.org/html/2606.15044#bib.bib27)\)is a widely used tokenization strategy in state\-of\-the\-art LLMs, including the GPT\(OpenAI,[2025](https://arxiv.org/html/2606.15044#bib.bib19)\)and Llama\(Touvronet al\.,[2023](https://arxiv.org/html/2606.15044#bib.bib10)\)families, due to its simplicity and compression efficiency\. Byte\-level BPE encodes characters as UTF\-8 bytes\(Consortium,[2011](https://arxiv.org/html/2606.15044#bib.bib33)\)and iteratively learns byte\-pair merges based on global co\-occurrence frequency\. This procedure introduces a structural bias as one Latin character is encoded as a single byte, while one non\-Latin character requires two or more bytes\. Combined with English\-centric pretraining corpora, BPE’s merge operations disproportionately favor Latin scripts and high\-resource languages\(Arnettet al\.,[2024](https://arxiv.org/html/2606.15044#bib.bib4)\)\.

The practical consequences of such bias are significant\.Petrovet al\.\([2023](https://arxiv.org/html/2606.15044#bib.bib3)\)demonstrated that GPT\-4’s Byte\-level BPE tokenizer produces sequence length disparities of up to 15×, with Chinese requiring 1\.9× more tokens than English, Vietnamese 2\.5×, and Burmese 11\.7×\. For speakers of low\-resource non\-Latin languages such as Khmer and Lao, these disparities translate directly into higher inference costs, degraded long\-context reasoning, and diminished downstream task accuracy\(Tamang and Bora,[2024](https://arxiv.org/html/2606.15044#bib.bib31)\)\.

Several tokenizers have been proposed to address these inequities\. Parity\-aware Byte\-Pair Encoding rebalances merge frequencies across scripts\(Foroutanet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib17)\)\. Morphology\-Driven Byte Encoding \(MYTE\) grounds segmentation in morphological structure\(Limisiewiczet al\.,[2024](https://arxiv.org/html/2606.15044#bib.bib13)\)\. Byte Latent Transformer \(BLT\) sidesteps a fixed vocabulary by operating directly over dynamic byte patches\(Pagnoniet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib20)\)\. Each work evaluates its approach against BPE baselines, reporting improvements in equity and multilingual capability\. However, these methods have never been compared against each other under uniform experimental conditions\.

In this paper, we present a benchmarking study to address this gap with the first systematic analysis of equitable tokenizers\. We compare them across eleven Southeast Asian \(SEA\) languages: English, Burmese, Chinese, Indonesian, Khmer, Lao, Malay, Tagalog, Tamil, Thai, and Vietnamese\. Using Byte\-level BPE as a baseline, and controlling for training data, vocabulary size, and computational budget, we evaluate intrinsic tokenizer metrics and examine downstream LLM performance by training 1\.5B\-parameter decoder\-only language models from scratch\. Our study provides a direct empirical comparison of equitable tokenization methods, offering actionable insights for NLP practitioners to build fairer multilingual LLMs\.

## 2Related Work

### 2\.1Subword Tokenization

Subword tokenization has become the standard preprocessing step in multilingual LLMs, to uniformly segment text in any language into tokens\. However, when trained on heterogeneous multilingual corpora, these approaches allocate vocabulary capacity toward languages with high resource or written in Latin scripts, embedding structural bias and inequity into the vocabulary\.

The downstream consequences are well\-documented\.Bostrom and Durrett \([2020](https://arxiv.org/html/2606.15044#bib.bib6)\)showed that BPE tokens frequently diverge from linguistically motivated morpheme boundaries\. More recently,Selvamuruganet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib2)\)quantified cross\-lingual tokenization inequity through normalized sequence length and subword fertility, demonstrating that the gap is most pronounced for underrepresented scripts\. These findings motivate moving beyond global frequency optimization as the main design criterion for multilingual tokenizers\.

### 2\.2Parity\-aware Byte\-Pair Encoding

Parity\-aware BPE \(PA BPE;Foroutanet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib17)\) modifies Byte\-level BPE by optimizing the worst\-case compression rate across languages\. Each merge iteration selects the pair that most improves the worst\-performing language, trading marginal global efficiency for tokenization equity\.

The approach requires minimal implementation changes to existing BPE pipelines\. On a 30\-language unbalanced dataset, it achieves a lower Gini coefficient of 0\.011 versus 0\.064 for Byte\-level BPE, while remaining competitive on compression and outperforming or matching Byte\-level BPE baselines across 13 multilingual benchmarks\.

### 2\.3Morphology\-Driven Byte Encoding

MYTE\(Limisiewiczet al\.,[2024](https://arxiv.org/html/2606.15044#bib.bib13)\)replaces UTF\-8’s character\-based convention with morpheme\-based byte codes, as morphemes exhibit more consistent sequence lengths than characters across languages\. It learns a per\-language morpheme inventory to achieve balanced morphological coverage via Morfessor 2\.0\(Smitet al\.,[2014](https://arxiv.org/html/2606.15044#bib.bib30)\), and assigns shorter byte sequences to linguistically meaningful units\.

MYTE produces shorter encoding compared to UTF\-8 for all 99 languages tested, with gains ranging from 1% for Vietnamese and Chinese to nearly 70% for Burmese\. Its worst\-case tokenizer parity relative to English is 1\.7, versus 3\.5 for UTF\-8\. MyT5, a MYTE\-encoded variant of ByT5\(Xueet al\.,[2022](https://arxiv.org/html/2606.15044#bib.bib35)\), demonstrated reduced cross\-language perplexity disparity compared to its byte\-level counterpart\. It achieves 75\.3 F1 on XTREME\-UP\(Ruderet al\.,[2023](https://arxiv.org/html/2606.15044#bib.bib26)\)question answering versus 73\.2 for ByT5\.

### 2\.4Byte Latent Transformer

BLT\(Pagnoniet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib20)\)eliminates explicit tokenization entirely and comprises three modules: a lightweight local encoder producing patches, a large latent transformer processing them, and a lightweight local decoder reconstructing bytes\. An entropy model drives patch segmentation, allocating computation proportional to data complexity\.

BLT enables a 50% reduction in inference FLOPs relative to Llama 3’s original tokenizer without sacrificing downstream task performance\.\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.15044#bib.bib1)\)\. By avoiding a static vocabulary from tokenization, BLT sidesteps multilingual inequity that arises when high\-resource language tokens dominate and outperforms Llama 3 by 2 BLEU points\(Papineniet al\.,[2002](https://arxiv.org/html/2606.15044#bib.bib21)\)on translation into English\.

## 3Methods

We compare the three tokenizer families discussed above to a baseline Byte\-level BPE tokenizer\. We train all tokenizers on the same dataset to evaluate their efficiency and cross\-lingual equity\. We then train language models from scratch using these tokenizers and evaluate their downstream task performance\. For fairness and reproducibility, data sizes are reported in number of sentences and bytes, rather than tokens\.

### 3\.1Training Data

For tokenizer training, we sample a total of 1 million sentences \(3\.5GB\) across eleven SEA languages from multilingual C4 \(mC4\)\(Xueet al\.,[2021](https://arxiv.org/html/2606.15044#bib.bib36)\)\. Sampling is performed randomly without replacement following the language proportions in mC4 to approximate realistic multilingual data distribution\. The resulting per\-language sentence counts are detailed in Appendix[A\.1](https://arxiv.org/html/2606.15044#A1.SS1)\.

For language model training, we adopt the same training dataset asForoutanet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib17)\)and sample 100 million sentences \(203 GB\) from FineWeb2\(Penedoet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib9)\)\. This dataset size is comparable to whatForoutanet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib17)\)andLimisiewiczet al\.\([2024](https://arxiv.org/html/2606.15044#bib.bib13)\)used to train their language models\. FineWeb2 is a multilingual web corpus with quality filtering already applied, and we did not apply further preprocessing before training\. Language proportions are controlled using temperature sampling withτ=1\.21\\tau=1\.21to boost the representation of low\-resource languages\(Foroutanet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib17)\)\. The details are provided in Appendix[A\.2](https://arxiv.org/html/2606.15044#A1.SS2)\.

Vocabulary sizes are controlled where possible to enable a fair comparison of the four tokenizers\. MYTE was designed to have 4,096 morphemes per language to avoid over\-segmentation\. Thus, we train tokenizers at three scales: 4,096, 8,192, and 12,288 tokens per language, across all eleven SEA languages\. For MYTE, this translates to total morpheme inventories of45k45k,90k90k, and135k135kmorphemes\. The vocabulary sizes of Byte\-level BPE and Parity\-aware BPE are matched to MYTE’s total morpheme counts at each scale\.

BLT’s patch\-based representation is not directly comparable since it does not learn a fixed vocabulary\. Following the approach ofPagnoniet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib20)\), we configure BLT’s entropy model to yield average patch sizes of 4\.5, 6, and 8 bytes per patch\.

We use tokenizers with vocabulary size of90k90kto train language models, placing them close to the100k100k–128k128kvocabulary size of most LLM tokenizers\(Wegmannet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib34)\)\. For BLT, we adopt the entropy model with an average patch size of 4\.5 bytes, following the setup ofPagnoniet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib20)\)\. Note that BLT is not a tokenizer in the traditional sense, but is referred to as one here for ease of comparison\.

### 3\.2Implementation Details

Training of tokenizers and tokenization of language model training data for MYTE and BPE\-based algorithms were performed on a single AMD EPYC 9554P CPU \(128 threads\)\. For BLT, the entropy\-based tokenizer was trained on 4× NVIDIA H100 GPUs, and the language model training dataset was tokenized on 8× NVIDIA H200 GPUs\. Statistics of the language model training dataset after tokenization are reported in Table[1](https://arxiv.org/html/2606.15044#S3.T1)\.

TokenizerDuration\# TokensFile size\(size\)\(hour\)\(billion\)\(GB\)BLT \(4\.5\)3342204MYTE \(90k\)50269538PA BPE \(90k\)382329BPE \(90k\)372288Table 1:Statistics of language model training dataset after tokenization by the four tokenizers\. Legend: size = patch size for BLT, morpheme inventory size for MYTE, vocabulary size for all other models; File Size = Size of dataset files after tokenization; PA BPE = Parity\-aware BPE; BPE = Byte\-level BPE\.Language model training is carried out on 4–8× NVIDIA H100/H200 GPUs\. To enable a fair comparison of computational cost, training durations are converted to an 8× NVIDIA H200 equivalent, as reported in Table[2](https://arxiv.org/html/2606.15044#S3.T2)\. MYTE incurs the highest training cost at 300 normalized hours due to its substantially larger token count \(269B tokens\), while Byte\-level BPE is the most efficient at 68 hours \(72B tokens\)\. Additionally, we trained and compared all language models at an equal token count of 38B tokens as measured by their respective tokenizers\. These experiments yielded the same conclusions as the models trained on the same dataset, so we omit them for the sake of brevity\.

ModelDuration\# Tokens\(size\)\(hour\)\(billion\)BLT \(4\.5\)16042MYTE \(90k\)300269PA BPE \(90k\)8782BPE \(90k\)6872Table 2:Statistics of language model training\.
### 3\.3Evaluation Metrics

#### 3\.3\.1Intrinsic Metrics

Quantifying tokenizer efficiency and cross\-lingual fairness requires metrics that are agnostic to both language and model architecture\. We identify three such metrics from recent literature and provide brief descriptions below\. Detailed definitions and formulae can be found in Appendix[B](https://arxiv.org/html/2606.15044#A2)\.

Tokenizer paritymeasures the ratio of the number of tokens per sentence in a given language relative to English\(Petrovet al\.,[2023](https://arxiv.org/html/2606.15044#bib.bib3)\)\. A*tokenizer parity close to 1*indicates that the tokenizer imposes roughly equal computational cost across the given language and English\.

Gini coefficientadapts the income inequality measure to the domain of tokenization fairness\(Foroutanet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib17)\)\. It quantifies the distribution of per\-language tokenization costs, with values ranging from 0 \(perfect equality\) to 1 \(maximal inequality\)\. A*lower Gini coefficient*reflects a more equitable tokenizer\.

Compression ratemeasures how efficiently a tokenizer compresses text\(Foroutanet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib17)\)\. A*higher compression rate*indicates that the tokenizer is more efficient and produces fewer tokens for the same amount of text\.

#### 3\.3\.2Extrinsic Metrics

We evaluate trained language models on English and multilingual classification benchmarks using Language Model Evaluation Harness\(Bidermanet al\.,[2024](https://arxiv.org/html/2606.15044#bib.bib5)\)with zero\-shot prompting\. Details of the benchmarks can be found in Appendix[C](https://arxiv.org/html/2606.15044#A3)\. For machine translation, we evaluate fine\-tuned models with five\-shot prompts drawn from the training dataset of the multi\-way parallel FLORES\+ corpus\(Costa\-jussàet al\.,[2024](https://arxiv.org/html/2606.15044#bib.bib18)\), following the setup ofLimisiewiczet al\.\([2024](https://arxiv.org/html/2606.15044#bib.bib13)\)\.

To assess English language understanding, models are evaluated on three English classification benchmarks used byPagnoniet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib20)\): PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2606.15044#bib.bib37)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.15044#bib.bib38)\), and Arc\-C\(Clarket al\.,[2018](https://arxiv.org/html/2606.15044#bib.bib22)\)\. These benchmarks test commonsense reasoning, sentence completion, and science question answering respectively\.

Cross\-lingual generalization is assessed through three multilingual classification benchmarks used byForoutanet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib17)\)\. XNLI\(Conneauet al\.,[2018](https://arxiv.org/html/2606.15044#bib.bib7)\)evaluates natural language inference across multiple languages, XCOPA\(Pontiet al\.,[2020](https://arxiv.org/html/2606.15044#bib.bib23)\)tests causal commonsense reasoning in a multilingual setting, and XStoryCloze\(Linet al\.,[2022](https://arxiv.org/html/2606.15044#bib.bib14)\)assesses story completion across languages\.

Machine translation is assessed after fine\-tuning via continual pre\-training to ensure that results reflect the task\-adapted performance of the models\. Fine\-tuning details and the resulting per\-language sentence counts of the fine\-tuning dataset are provided in Appendix[A\.3](https://arxiv.org/html/2606.15044#A1.SS3)\. BLEU\(Papineniet al\.,[2002](https://arxiv.org/html/2606.15044#bib.bib21)\)and chrF\(Popović,[2015](https://arxiv.org/html/2606.15044#bib.bib24)\)scores are computed in both EN→\\rightarrowXX and XX→\\rightarrowEN directions for all ten non\-English SEA languages\. These two metrics range from 0 to 100 and higher scores indicate better translation quality\.

## 4Experiments

### 4\.1Intrinsic Evaluation

We evaluate the trained tokenizers on FLORES\+ devtest set, which consists of 1,012 aligned sentences across all eleven SEA languages\. For Parity\-aware BPE, we train the base variant with the FLORES\+ training dataset as the development corpus following the setup ofForoutanet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib17)\)\. It is the only tokenizer among those evaluated that requires parallel data during training\.

### 4\.2Extrinsic Evaluation

#### 4\.2\.1Language Model

The base architecture of our language model is OLMo\-2\-1B\(Walshet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib8)\), a decoder\-only transformer comprising 16 layers, a hidden dimension of 2,048, and approximately 1\.5 billion parameters\. We train all models from scratch to ensure that differences in downstream task performance are largely attributed to tokenizer choice\.

#### 4\.2\.2Training Configuration

Models are trained using the AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2606.15044#bib.bib11)\)optimizer with a peak learning rate of4\.0×10−44\.0\\times 10^\{\-4\}, weight decay of 0\.1,β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95, and gradient clipping of 1\.0\. The learning rate follows a Warmup\-Stable\-Decay \(WSD\) schedule\(Huet al\.,[2024](https://arxiv.org/html/2606.15044#bib.bib28)\): a linear warmup over the first 1% of training tokens, a stable phase over the next 89%, and a linear decay over the final 10%\. The global batch size is 512 sequences with a maximum sequence length of 4,096 tokens, corresponding to approximately 2 million tokens per training step\.

#### 4\.2\.3Statistical Significance

Extrinsic metrics are assessed for statistical significance using paired bootstrap resampling with 1,000 iterations atp<0\.05p<0\.05\(Koehn,[2004](https://arxiv.org/html/2606.15044#bib.bib12)\)\. This ensures that reported performance differences between models reflect meaningful systematic effects rather than sampling variation across examples\.

## 5Results

### 5\.1Intrinsic Evaluation

TokenizerCRGiniTP\(size\)BLT \(4\.5\)0\.01270\.2122\.87BLT \(6\)0\.01450\.2192\.96BLT \(8\)0\.01610\.2273\.06MYTE \(45k\)0\.00850\.0851\.23MYTE \(90k\)0\.00890\.0861\.23MYTE \(135k\)0\.00890\.0951\.26PA BPE \(45k\)0\.02500\.0211\.15PA BPE \(90k\)0\.02720\.0281\.24PA BPE \(135k\)0\.02800\.0291\.25BPE \(45k\)0\.02570\.2431\.93BPE \(90k\)0\.02930\.2201\.73BPE \(135k\)0\.03140\.2031\.61Table 3:Intrinsic evaluation of tokenizers on identical training data\. Compression rate and tokenizer parity values are macro\-averaged across languages\. The best result is inboldand the second\-best isunderlined\. Legend: CR = Compression Rate, TP = Tokenizer Parity\.![Refer to caption](https://arxiv.org/html/2606.15044v1/fig1.png)Figure 1:Efficiency\-equity Pareto front of the evaluated tokenizers\. Values beside markers indicate the patch size for BLT, morpheme inventory size for MYTE, and vocabulary size for BPE and PA BPE\.Table[3](https://arxiv.org/html/2606.15044#S5.T3)reveals that Parity\-aware BPE achieves the lowest Gini coefficient across all vocabulary sizes\. This equity gain does not come at the cost of tokenizer efficiency, as Parity\-aware BPE achieves competitive compression rates relative to Byte\-level BPE\.

MYTE has a relatively low Gini coefficient and tokenizer parity, but its compression rate is the worst\. This indicates that morpheme\-based segmentation produces longer token sequences\. The trade\-off is consistent with the inflated token counts observed during language model training\.

BLT exhibits poor equity across all vocabulary sizes despite operating at the byte level without a fixed vocabulary\. Its tokenizer parity is the highest among all approaches, suggesting that entropy\-driven patch segmentation provides no built\-in mechanism to correct for corpus imbalances\.

Figure[1](https://arxiv.org/html/2606.15044#S5.F1)situates each tokenizer family within a two\-dimensional efficiency\-equity space, where the ideal direction is toward a higher compression rate and lower Gini coefficient \(bottom\-right of the plot\)\. Parity\-aware BPE lies on the Pareto front of the efficiency\-equity space across all vocabulary sizes, highlighting that cross\-lingual fairness and compression efficiency are not at odds\. Byte\-level BPE at its largest vocabulary size of135k135klies on the Pareto front, as it has the highest compression rate\. On the other hand, both BLT and MYTE are Pareto\-dominated\.

### 5\.2Extrinsic Evaluation

#### 5\.2\.1English Classification Benchmarks

ModelPIQAHellaSwagArc\-C\(size\)\(50\.00\)\(25\.00\)\(25\.00\)BLT \(4\.5\)66\.1044\.8124\.74MYTE \(90k\)67\.1444\.4727\.39PA BPE \(90k\)71\.5553\.2226\.19BPE \(90k\)72\.3154\.4228\.41Table 4:Accuracies on English classification benchmarks\. Parenthesized values indicate the expected accuracy of a random classifier\. The highest score is inboldand the second\-highest isunderlined\.Table[4](https://arxiv.org/html/2606.15044#S5.T4)shows that Byte\-level BPE achieves the highest scores on all three English classification benchmarks\. Its advantage is statistically significant across all comparisons, except over Parity\-aware BPE on PIQA and MYTE on Arc\-C, where its numerical lead does not reach significance\. Parity\-aware BPE is a strong runner\-up, performing significantly better than BLT and MYTE on both PIQA and HellaSwag\. For commonsense reasoning and sentence completion benchmarks, BPE\-based tokenizers have a consistent advantage over alternative approaches in our evaluation setting\.

#### 5\.2\.2Multilingual Classification Benchmarks

ModelXNLIXCOPAXStoryCloze\(size\)\(33\.33\)\(50\.00\)\(50\.00\)BLT \(4\.5\)36\.2953\.3056\.52MYTE \(90k\)42\.4954\.9350\.55PA BPE \(90k\)40\.4358\.7056\.70BPE \(90k\)40\.5661\.0357\.18Table 5:Averaged per\-language accuracies across multilingual classification benchmarks\. Parenthesized values indicate the expected accuracy of a random classifier\. The highest score is inboldand the second\-highest isunderlined\. Detailed results for each language are reported in Appendix[D\.1](https://arxiv.org/html/2606.15044#A4.SS1)\.Table[5](https://arxiv.org/html/2606.15044#S5.T5)reveals task\-dependent performance across tokenizers, with no consistent winner across all three benchmarks\. MYTE significantly outperforms all other tokenizers on XNLI, consistent with its morphological representations providing richer cross\-lingual semantic signals for inference\. Byte\-level BPE achieves the highest scores on XCOPA and XStoryCloze, suggesting that its efficiency is best leveraged on tasks requiring causal and narrative reasoning\.

#### 5\.2\.3Machine Translation

Model \(size\)EN→\\toXXXX→\\toENBLT \(4\.5\)10\.8211\.70MYTE \(90k\)14\.7713\.81PA BPE \(90k\)11\.3612\.19BPE \(90k\)13\.3912\.78Table 6:Averaged BLEU scores across ten SEA languages\. The highest score is inboldand the second\-highest isunderlined\.Table[6](https://arxiv.org/html/2606.15044#S5.T6)aggregates the BLEU translation scores across ten SEA languages\. We also provide the chrF scores in Appendix[D\.2\.2](https://arxiv.org/html/2606.15044#A4.SS2.SSS2)\. MYTE achieves the highest BLEU scores in both translation directions\. A consistent directional asymmetry is observed across both metrics for MYTE, where it is systematically stronger in EN→\\rightarrowXX translation than XX→\\rightarrowEN\. MYTE’s morpheme\-level segmentation enables finer\-grained generation of morphologically complex target word forms in SEA languages, an advantage that narrows when translating into English\.

## 6Analysis

### 6\.1Effect of Scaling Vocabulary Size

Increasing vocabulary size produces distinct behavior across tokenizers, as seen in Table[3](https://arxiv.org/html/2606.15044#S5.T3)\. Byte\-level BPE improves consistently across both efficiency and cross\-lingual fairness with a larger vocabulary size\. In contrast, BLT gains in efficiency but sacrifices cross\-lingual equity as vocabulary size grows\.

MYTE is relatively insensitive to morpheme inventory size scaling within the evaluated range\. Compression rate and tokenizer parity remain nearly constant across all three morpheme inventory sizes, indicating that its morphological segmentation approach saturates at around 4,096 morphemes per language\.

Parity\-aware BPE becomes more inequitable as vocabulary size increases\. While it achieves the lowest Gini coefficient among all models, both its Gini coefficient and tokenizer parity worsen at larger vocabulary sizes, increasing to 0\.029 and 1\.25 respectively at135k135kvocabulary size\.

### 6\.2Fairness Regression in Parity\-aware BPE

It seems counterintuitive that Parity\-aware BPE tokenizers are less equitable as vocabulary size increases\. To investigate this, we analyzed per\-language token counts produced by Parity\-aware BPE tokenizers of different vocabulary sizes\. We use the FLORES\+ training dataset \(rather than the test dataset\) because it serves as the development corpus during tokenizer training\. Examining this dataset would isolate the effect of vocabulary scaling and avoid confounding factors from unseen data such as differing vocabulary distributions\.

![Refer to caption](https://arxiv.org/html/2606.15044v1/fig2.png)Figure 2:Per\-language token counts on the FLORES\+ training dataset by Parity\-aware BPE tokenizers of varying vocabulary sizes\.Figure[2](https://arxiv.org/html/2606.15044#S6.F2)shows that increasing the vocabulary size results in a larger reduction in the tokens required for English sentences compared to the other SEA languages\. This results in token count disparity between English and the other SEA languages to widen by 36% as vocabulary size scales from45k45kto135k135k, causing tokenizer parity to increase \(i\.e\., worsen\)\. We observe that Parity\-aware BPE only limits worst\-case per\-language tokenizer parity, so its fairness mechanism does not prevent English from acquiring more merges as vocabulary size increases\.

### 6\.3Effect of Tokenizer Choice and Script Type

To investigate whether tokenizer parity outcomes are driven by the choice of tokenizer, the script type of the target language, or their interaction, we apply a two\-way mixed ANOVA test\(Meyerset al\.,[2009](https://arxiv.org/html/2606.15044#bib.bib16)\)\. Running separate pairwisett\-tests for each tokenizer–script combination would inflate the Type I error rate multiplicatively\.

The between\-subject factor is script type, categorized as Latin \(Indonesian, Malay, Tagalog, Vietnamese\) or Abugida \(Burmese, Khmer, Lao, Tamil, Thai\) according toLimisiewiczet al\.\([2024](https://arxiv.org/html/2606.15044#bib.bib13)\)\. Chinese is excluded from this comparison as it is the only CJK\-script language\. The within\-subject factor is tokenizer choice, as measuring the same language across tokenizers introduces within\-subject correlation\. We assess the effect of these factors on per\-language tokenizer parity, measure statistical significance atα=0\.05\\alpha=0\.05, and report partialη2\\eta^\{2\}as the effect size measure\.

TokenizerScript type\(size\)LatinAbugidaBLT \(4\.5\)2\.363\.38MYTE \(90k\)1\.191\.32PA BPE \(90k\)1\.261\.22BPE \(90k\)1\.302\.18Table 7:Per\-language tokenizer parity, macro\-averaged by tokenizer and script type\. Latin scripts exclude English\. The lowest value is inboldand the second\-lowest isunderlined\.Tokenizer choice is the only statistically significant factor affecting tokenizer parity \(p<0\.001p<0\.001, partialη2=0\.752\\eta^\{2\}=0\.752\)\. The large effect size indicates that tokenizer choice explains the majority of variance in tokenizer parity, regardless of script type\. At the same time, script type alone does not reach significance \(p=0\.090p=0\.090\)\. As shown in Table[7](https://arxiv.org/html/2606.15044#S6.T7), Parity\-aware BPE and MYTE achieve near\-uniform tokenizer parity across both script types\. In contrast, Byte\-level BPE imposes a 1\.68× tokenizer parity penalty on Abugida scripts relative to Latin scripts\. BLT also exhibits a high cross\-script disparity \(1\.43×\), consistent with its entropy model being undertrained on non\-Latin scripts\.

Fundamentally, tokenizer parity directly determines inference cost\. A tokenizer parity ofkkfor a given language implies that a user payskktimes the per\-token API cost relative to English to process the same semantic content\. Under Byte\-level BPE, Abugida\-script users incur 1\.68× higher costs on average than English users for semantically equivalent prompts\. These results confirm that equitable tokenizer design, and not script similarity to English, is the main determinant of whether a tokenizer imposes uniform computational costs across languages\.

## 7Conclusion

We present the first systematic, dataset\-controlled comparison of BLT, MYTE, Parity\-aware BPE, and Byte\-level BPE across eleven SEA languages, evaluating tokenizer equity, compression efficiency, and downstream task performance under the same experimental conditions\. Our intrinsic evaluation demonstrates that cross\-lingual equity and tokenization efficiency are not fundamentally at odds\.

Among the equitable tokenizers analyzed, MYTE delivers the strongest semantic inference and machine translation performance through richer morphological representations, though at the cost of a higher computational budget and lower compression efficiency\. Despite its architectural novelty in eliminating fixed tokenizer vocabulary, we found that BLT underperforms on downstream tasks as its entropy model receives insufficient exposure to low\-resource languages under realistic multilingual data distribution\.

The appropriate choice of tokenizer is ultimately use\-case dependent\. We recommend Parity\-aware BPE as a responsible default for multilingual models targeting SEA languages, given its favorable position in the efficiency\-equity space and relatively strong downstream task performance\. However, the base variant of Parity\-aware BPE requires multi\-way parallel data, which may be scarce or unavailable in low\-resource settings\(Foroutanet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib17)\)\. MYTE is preferred when morphology and translation are critical, and the computational budget permits the associated training overhead\.

Our ANOVA results establish that tokenizer choice is the primary factor affecting tokenizer parity\. The difference in inference costs between English and SEA\-language users can be addressed through the choice of the tokenizer\. Equitable tokenization has direct, quantifiable consequences for the 671 million speakers across Southeast Asia\(Loveniaet al\.,[2024](https://arxiv.org/html/2606.15044#bib.bib15)\), for whom token\-priced APIs create unequal access costs\. How tokenization is carried out across languages shapes the economic accessibility of multilingual models for underrepresented language communities\.

## Limitations

First, all language models are trained at the 1\.5B\-parameter scale due to computational resource constraints\. We leave the investigation of larger model sizes to future work\. Next, BLT cannot be scaled along the same vocabulary dimension as the other tokenizers since it operates without a fixed vocabulary, which prevents direct matched\-vocabulary size comparison\. Finally, we evaluate only base pretrained models\. We believe this is sufficient, as supervised fine\-tuning or alignment training does not affect the tokenizer’s fairness or efficiency\.

We do not anticipate any immediate societal or individual harm arising from this work\. Nevertheless, we advise users to exercise caution, as our models have not been subjected to safety or value alignment procedures\.

## References

- A bit of a problem: measurement disparities in dataset sizes across languages\.InProceedings of SIGUL,pp\. 1–9\.External Links:[Link](https://aclanthology.org/2024.sigul-1.1/)Cited by:[§1](https://arxiv.org/html/2606.15044#S1.p2.1)\.
- S\. Biderman, H\. Schoelkopf, L\. Sutawika, L\. Gao, J\. Tow, B\. Abbasi, A\. F\. Aji, P\. S\. Ammanamanchi, S\. Black, J\. Clive, A\. DiPofi, J\. Etxaniz, B\. Fattori, J\. Z\. Forde, C\. Foster, J\. Hsu, M\. Jaiswal, W\. Y\. Lee, H\. Li, C\. Lovering, N\. Muennighoff, E\. Pavlick, J\. Phang, A\. Skowron, S\. Tan, X\. Tang, K\. A\. Wang, G\. I\. Winata, F\. Yvon, and A\. Zou \(2024\)Lessons from the trenches on reproducible evaluation of language models\.arXiv preprint arXiv:2405\.14782\.External Links:[Link](https://arxiv.org/abs/2405.14782),2405\.14782Cited by:[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p1.1)\.
- Y\. Bisk, R\. Zellers, R\. LeBras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InProceedings of AAAI,pp\. 7432–7439\.External Links:[Link](https://aaai.org/ojs/index.php/AAAI/article/view/6239)Cited by:[§C\.1](https://arxiv.org/html/2606.15044#A3.SS1.p2.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p2.1)\.
- K\. Bostrom and G\. Durrett \(2020\)Byte pair encoding is suboptimal for language model pretraining\.InFindings of EMNLP,pp\. 4617–4624\.External Links:[Link](https://aclanthology.org/2020.findings-emnlp.414)Cited by:[§2\.1](https://arxiv.org/html/2606.15044#S2.SS1.p2.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.External Links:[Link](https://arxiv.org/abs/1803.05457),1803\.05457Cited by:[§C\.1](https://arxiv.org/html/2606.15044#A3.SS1.p4.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p2.1)\.
- A\. Conneau, R\. Rinott, G\. Lample, A\. Williams, S\. Bowman, H\. Schwenk, and V\. Stoyanov \(2018\)XNLI: evaluating cross\-lingual sentence representations\.InProceedings of EMNLP,pp\. 2475–2485\.External Links:[Link](https://aclanthology.org/D18-1269)Cited by:[§C\.2](https://arxiv.org/html/2606.15044#A3.SS2.p2.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p3.1)\.
- T\. U\. Consortium \(2011\)The Unicode standard\.Technical reportTechnical ReportVersion 6\.0\.0,Unicode Consortium\.External Links:[Link](https://www.unicode.org/versions/Unicode6.0.0/)Cited by:[§1](https://arxiv.org/html/2606.15044#S1.p2.1)\.
- M\. R\. Costa\-jussà, J\. Cross, O\. Çelebi, M\. Elbayad, K\. Heafield, K\. Heffernan, E\. Kalbassi, J\. Lam, D\. Licht, J\. Maillard, A\. Sun, S\. Wang, G\. Wenzek, A\. Youngblood, B\. Akula, L\. Barrault, G\. M\. Gonzalez, P\. Hansanti, J\. Hoffman, S\. Jarrett, K\. R\. Sadagopan, D\. Rowe, S\. Spruit, C\. Tran, P\. Andrews, N\. F\. Ayan, S\. Bhosale, S\. Edunov, A\. Fan, C\. Gao, V\. Goswami, F\. Guzmán, P\. Koehn, A\. Mourachko, C\. Ropers, S\. Saleem, H\. Schwenk, and J\. Wang \(2024\)Scaling neural machine translation to 200 languages\.Nature,pp\. 841–846\.External Links:ISSN 1476\-4687,[Link](https://doi.org/10.1038/s41586-024-07335-x)Cited by:[§A\.3](https://arxiv.org/html/2606.15044#A1.SS3.p1.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p1.1)\.
- N\. Foroutan, C\. Meister, D\. Paul, J\. Niklaus, S\. Ahmadi, A\. Bosselut, and R\. Sennrich \(2025\)Parity\-aware Byte\-Pair Encoding: improving cross\-lingual fairness in tokenization\.arXiv preprint arXiv:2508\.04796\.External Links:[Link](https://arxiv.org/abs/2508.04796),2508\.04796Cited by:[§A\.2](https://arxiv.org/html/2606.15044#A1.SS2.p1.1),[§B\.1](https://arxiv.org/html/2606.15044#A2.SS1.p4.3),[§B\.2](https://arxiv.org/html/2606.15044#A2.SS2.p1.2),[§C\.2](https://arxiv.org/html/2606.15044#A3.SS2.p1.1),[§1](https://arxiv.org/html/2606.15044#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.15044#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2606.15044#S3.SS1.p2.1),[§3\.3\.1](https://arxiv.org/html/2606.15044#S3.SS3.SSS1.p3.1),[§3\.3\.1](https://arxiv.org/html/2606.15044#S3.SS3.SSS1.p4.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p3.1),[§4\.1](https://arxiv.org/html/2606.15044#S4.SS1.p1.1),[§7](https://arxiv.org/html/2606.15044#S7.p3.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783),2407\.21783Cited by:[§2\.4](https://arxiv.org/html/2606.15044#S2.SS4.p2.1)\.
- S\. Hu, Y\. Tu, X\. Han, C\. He, G\. Cui, X\. Long, Z\. Zheng, Y\. Fang, Y\. Huang, W\. Zhao, X\. Zhang, Z\. L\. Thai, K\. Zhang, C\. Wang, Y\. Yao, C\. Zhao, J\. Zhou, J\. Cai, Z\. Zhai, N\. Ding, C\. Jia, G\. Zeng, D\. Li, Z\. Liu, and M\. Sun \(2024\)MiniCPM: unveiling the potential of small language models with scalable training strategies\.arXiv preprint arXiv:2404\.06395\.External Links:[Link](https://arxiv.org/abs/2404.06395),2404\.06395Cited by:[§4\.2\.2](https://arxiv.org/html/2606.15044#S4.SS2.SSS2.p1.3)\.
- S\. Issaka, E\. R\. Gonzalez, L\. Liu, E\. K\. Agyei, L\. Bandarkar, N\. Peng, D\. I\. Adelani, F\. Guzmán, and S\. Gabriel \(2026\)Translation as a scalable proxy for multilingual evaluation\.arXiv preprint arXiv:2601\.11778\.External Links:[Link](https://arxiv.org/abs/2601.11778),2601\.11778Cited by:[§C\.3](https://arxiv.org/html/2606.15044#A3.SS3.p1.1)\.
- P\. Koehn \(2004\)Statistical significance tests for machine translation evaluation\.InProceedings of EMNLP,pp\. 388–395\.External Links:[Link](https://aclanthology.org/W04-3250)Cited by:[§4\.2\.3](https://arxiv.org/html/2606.15044#S4.SS2.SSS3.p1.1)\.
- T\. Limisiewicz, T\. Blevins, H\. Gonen, O\. Ahia, and L\. Zettlemoyer \(2024\)MYTE: morphology\-driven byte encoding for better and fairer multilingual language modeling\.InProceedings of ACL,pp\. 15059–15076\.External Links:[Link](https://aclanthology.org/2024.acl-long.804/)Cited by:[§C\.3](https://arxiv.org/html/2606.15044#A3.SS3.p2.1),[§C\.3](https://arxiv.org/html/2606.15044#A3.SS3.p3.1),[§1](https://arxiv.org/html/2606.15044#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.15044#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2606.15044#S3.SS1.p2.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p1.1),[§6\.3](https://arxiv.org/html/2606.15044#S6.SS3.p2.2)\.
- X\. V\. Lin, T\. Mihaylov, M\. Artetxe, T\. Wang, S\. Chen, D\. Simig, M\. Ott, N\. Goyal, S\. Bhosale, J\. Du, R\. Pasunuru, S\. Shleifer, P\. S\. Koura, V\. Chaudhary, B\. O’Horo, J\. Wang, L\. Zettlemoyer, Z\. Kozareva, M\. Diab, V\. Stoyanov, and X\. Li \(2022\)Few\-shot learning with multilingual generative language models\.InProceedings of EMNLP,pp\. 9019–9052\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.616)Cited by:[§C\.2](https://arxiv.org/html/2606.15044#A3.SS2.p4.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p3.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InProceedings of ICLR,External Links:[Link](https://openreview.net/forum?id=Bkg6RiCqY7)Cited by:[§4\.2\.2](https://arxiv.org/html/2606.15044#S4.SS2.SSS2.p1.3)\.
- H\. Lovenia, R\. Mahendra, S\. M\. Akbar, L\. J\. V\. Miranda, J\. Santoso, E\. Aco, A\. Fadhilah, J\. Mansurov, J\. M\. Imperial, O\. P\. Kampman, J\. R\. A\. Moniz, M\. R\. S\. Habibi, F\. Hudi, R\. Montalan, R\. Ignatius, J\. A\. Lopo, W\. Nixon, B\. F\. Karlsson, J\. Jaya, R\. Diandaru, Y\. Gao, P\. Amadeus, B\. Wang, J\. C\. B\. Cruz, C\. Whitehouse, I\. H\. Parmonangan, M\. Khelli, W\. Zhang, L\. Susanto, R\. A\. Ryanda, S\. L\. Hermawan, D\. J\. Velasco, M\. D\. A\. Kautsar, W\. F\. Hendria, Y\. Moslem, N\. Flynn, M\. F\. Adilazuarda, H\. Li, J\. Lee, R\. Damanhuri, S\. Sun, M\. R\. Qorib, A\. Djanibekov, W\. Q\. Leong, Q\. V\. Do, N\. Muennighoff, T\. Pansuwan, I\. F\. Putra, Y\. Xu, T\. N\. Chia, A\. Purwarianti, S\. Ruder, W\. Tjhi, P\. Limkonchotiwat, A\. F\. Aji, S\. Keh, G\. I\. Winata, R\. Zhang, F\. Koto, Z\. Yong, and S\. Cahyawijaya \(2024\)SEACrowd: a multilingual multimodal data hub and benchmark suite for Southeast Asian languages\.InProceedings of EMNLP,pp\. 5155–5203\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.296/)Cited by:[§7](https://arxiv.org/html/2606.15044#S7.p4.1)\.
- L\. S\. Meyers, G\. Gamst, and A\. J\. Guarino \(2009\)Two\-way mixed anova design\.InData Analysis Using SAS Enterprise Guide,pp\. 253–266\.External Links:[Link](https://doi.org/10.1017/CBO9780511804786.027)Cited by:[§6\.3](https://arxiv.org/html/2606.15044#S6.SS3.p1.1)\.
- T\. S\. Nguyen, M\. R\. Qorib, and H\. T\. Ng \(2026\)OpenSeal: good, fast, and cheap construction of an open\-source Southeast Asian LLM via parallel data\.arXiv preprint arXiv:2602\.02266\.External Links:[Link](https://arxiv.org/abs/2602.02266),2602\.02266Cited by:[§A\.3](https://arxiv.org/html/2606.15044#A1.SS3.p1.1)\.
- OpenAI \(2025\)OpenAI/Tiktoken: tiktoken is a fast BPE tokeniser for use with OpenAI’s models\.\.External Links:[Link](https://github.com/OpenAI/Tiktoken)Cited by:[§1](https://arxiv.org/html/2606.15044#S1.p2.1)\.
- A\. Pagnoni, R\. Pasunuru, P\. Rodriguez, J\. Nguyen, B\. Muller, M\. Li, C\. Zhou, L\. Yu, J\. E\. Weston, L\. Zettlemoyer, G\. Ghosh, M\. Lewis, A\. Holtzman, and S\. Iyer \(2025\)Byte Latent Transformer: patches scale better than tokens\.InProceedings of ACL,pp\. 9238–9258\.External Links:[Link](https://aclanthology.org/2025.acl-long.453/)Cited by:[§C\.1](https://arxiv.org/html/2606.15044#A3.SS1.p1.1),[§C\.3](https://arxiv.org/html/2606.15044#A3.SS3.p2.1),[§1](https://arxiv.org/html/2606.15044#S1.p4.1),[§2\.4](https://arxiv.org/html/2606.15044#S2.SS4.p1.1),[§3\.1](https://arxiv.org/html/2606.15044#S3.SS1.p4.1),[§3\.1](https://arxiv.org/html/2606.15044#S3.SS1.p5.3),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p2.1)\.
- K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)BLEU: a method for automatic evaluation of machine translation\.InProceedings of ACL,pp\. 311–318\.External Links:[Link](https://aclanthology.org/P02-1040)Cited by:[§C\.3](https://arxiv.org/html/2606.15044#A3.SS3.p2.1),[§2\.4](https://arxiv.org/html/2606.15044#S2.SS4.p2.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p4.2)\.
- G\. Penedo, H\. Kydlíček, V\. Sabolčec, B\. Messmer, N\. Foroutan, A\. H\. Kargaran, C\. Raffel, M\. Jaggi, L\. V\. Werra, and T\. Wolf \(2025\)FineWeb2: one pipeline to scale them all – adapting pre\-training data processing to every language\.arXiv preprint arXiv:2506\.20920\.External Links:[Link](https://arxiv.org/abs/2506.20920),2506\.20920Cited by:[§A\.2](https://arxiv.org/html/2606.15044#A1.SS2.p1.1),[§3\.1](https://arxiv.org/html/2606.15044#S3.SS1.p2.1)\.
- A\. Petrov, E\. L\. Malfa, P\. H\. S\. Torr, and A\. Bibi \(2023\)Language model tokenizers introduce unfairness between languages\.InProceedings of NeurIPS,pp\. 36963–36990\.External Links:[Link](http://papers.nips.cc/paper_files/paper/2023/hash/74bb24dca8334adce292883b4b651eda-Abstract-Conference.html)Cited by:[§B\.1](https://arxiv.org/html/2606.15044#A2.SS1.p1.3),[§1](https://arxiv.org/html/2606.15044#S1.p1.1),[§1](https://arxiv.org/html/2606.15044#S1.p3.1),[§3\.3\.1](https://arxiv.org/html/2606.15044#S3.SS3.SSS1.p2.1)\.
- E\. M\. Ponti, G\. Glavaš, O\. Majewska, Q\. Liu, I\. Vulić, and A\. Korhonen \(2020\)XCOPA: a multilingual dataset for causal commonsense reasoning\.InProceedings of EMNLP,pp\. 2362–2376\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.185)Cited by:[§C\.2](https://arxiv.org/html/2606.15044#A3.SS2.p3.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p3.1)\.
- M\. Popović \(2015\)chrF: character n\-gram F\-score for automatic MT evaluation\.InProceedings of WMT,pp\. 392–395\.External Links:[Link](https://aclanthology.org/W15-3049)Cited by:[§C\.3](https://arxiv.org/html/2606.15044#A3.SS3.p2.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p4.2)\.
- M\. R\. Qorib, J\. Li, and H\. T\. Ng \(2025\)Just go parallel: improving the multilingual capabilities of large language models\.InProceedings of ACL,pp\. 33411–33424\.External Links:[Link](https://aclanthology.org/2025.acl-long.1602/)Cited by:[§A\.3](https://arxiv.org/html/2606.15044#A1.SS3.p2.1)\.
- S\. Ruder, J\. Clark, A\. Gutkin, M\. Kale, M\. Ma, M\. Nicosia, S\. Rijhwani, P\. Riley, J\. Sarr, X\. Wang, J\. Wieting, N\. Gupta, A\. Katanova, C\. Kirov, D\. Dickinson, B\. Roark, B\. Samanta, C\. Tao, D\. Adelani, V\. Axelrod, I\. Caswell, C\. Cherry, D\. Garrette, R\. Ingle, M\. Johnson, D\. Panteleev, and P\. Talukdar \(2023\)XTREME\-UP: a user\-centric scarce\-data benchmark for under\-represented languages\.InFindings of EMNLP,pp\. 1856–1884\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.125)Cited by:[§2\.3](https://arxiv.org/html/2606.15044#S2.SS3.p2.1)\.
- A\. Selvamurugan, R\. Dandekar, R\. Dandekar, and S\. Panat \(2025\)From bias to balance: how multilingual dataset composition affects tokenizer performance across languages\.InWorkshop on LM4UC,External Links:[Link](https://openreview.net/forum?id=P2k908rWSP)Cited by:[§2\.1](https://arxiv.org/html/2606.15044#S2.SS1.p2.1)\.
- R\. Sennrich, B\. Haddow, and A\. Birch \(2016\)Neural machine translation of rare words with subword units\.InProceedings of ACL,pp\. 1715–1725\.External Links:[Link](https://aclanthology.org/P16-1162)Cited by:[§1](https://arxiv.org/html/2606.15044#S1.p2.1)\.
- P\. Smit, S\. Virpioja, S\. Grönroos, and M\. Kurimo \(2014\)Morfessor 2\.0: toolkit for statistical morphological segmentation\.InProceedings of EACL,pp\. 21–24\.External Links:[Link](https://aclanthology.org/E14-2006)Cited by:[§2\.3](https://arxiv.org/html/2606.15044#S2.SS3.p1.1)\.
- S\. Tamang and D\. J\. Bora \(2024\)Evaluating tokenizer performance of large language models across official Indian languages\.arXiv preprint arXiv:2411\.12240\.External Links:[Link](https://arxiv.org/abs/2411.12240),2411\.12240Cited by:[§1](https://arxiv.org/html/2606.15044#S1.p3.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale, D\. Bikel, L\. Blecher, C\. C\. Ferrer, M\. Chen, G\. Cucurull, D\. Esiobu, J\. Fernandes, J\. Fu, W\. Fu, B\. Fuller, C\. Gao, V\. Goswami, N\. Goyal, A\. Hartshorn, S\. Hosseini, R\. Hou, H\. Inan, M\. Kardas, V\. Kerkez, M\. Khabsa, I\. Kloumann, A\. Korenev, P\. S\. Koura, M\. Lachaux, T\. Lavril, J\. Lee, D\. Liskovich, Y\. Lu, Y\. Mao, X\. Martinet, T\. Mihaylov, P\. Mishra, I\. Molybog, Y\. Nie, A\. Poulton, J\. Reizenstein, R\. Rungta, K\. Saladi, A\. Schelten, R\. Silva, E\. M\. Smith, R\. Subramanian, X\. E\. Tan, B\. Tang, R\. Taylor, A\. Williams, J\. X\. Kuan, P\. Xu, Z\. Yan, I\. Zarov, Y\. Zhang, A\. Fan, M\. Kambadur, S\. Narang, A\. Rodriguez, R\. Stojnic, S\. Edunov, and T\. Scialom \(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.External Links:[Link](https://arxiv.org/abs/2307.09288),2307\.09288Cited by:[§A\.2](https://arxiv.org/html/2606.15044#A1.SS2.SSS0.Px1.p2.3),[§1](https://arxiv.org/html/2606.15044#S1.p2.1)\.
- E\. P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan, N\. Lambert, D\. Schwenk, O\. Tafjord, T\. Anderson, D\. Atkinson, F\. Brahman, C\. Clark, P\. Dasigi, N\. Dziri, A\. Ettinger, M\. Guerquin, D\. Heineman, H\. Ivison, P\. W\. Koh, J\. Liu, S\. Malik, W\. Merrill, L\. J\. V\. Miranda, J\. Morrison, T\. Murray, C\. Nam, J\. Poznanski, V\. Pyatkin, A\. Rangapur, M\. Schmitz, S\. Skjonsberg, D\. Wadden, C\. Wilhelm, M\. Wilson, L\. Zettlemoyer, A\. Farhadi, N\. A\. Smith, and H\. Hajishirzi \(2025\)2 OLMo 2 furious\.InProceedings of COLM,External Links:[Link](https://openreview.net/forum?id=2ezugTT9kU)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.15044#S4.SS2.SSS1.p1.1)\.
- A\. Wegmann, D\. Nguyen, and D\. Jurgens \(2025\)Tokenization is sensitive to language variation\.InFindings of ACL,pp\. 10958–10983\.External Links:[Link](https://aclanthology.org/2025.findings-acl.572/)Cited by:[§3\.1](https://arxiv.org/html/2606.15044#S3.SS1.p5.3)\.
- L\. Xue, A\. Barua, N\. Constant, R\. Al\-Rfou, S\. Narang, M\. Kale, A\. Roberts, and C\. Raffel \(2022\)ByT5: towards a token\-free future with pre\-trained byte\-to\-byte models\.Transactions of ACL10,pp\. 291–306\.External Links:[Link](https://aclanthology.org/2022.tacl-1.17)Cited by:[§2\.3](https://arxiv.org/html/2606.15044#S2.SS3.p2.1)\.
- L\. Xue, N\. Constant, A\. Roberts, M\. Kale, R\. Al\-Rfou, A\. Siddhant, A\. Barua, and C\. Raffel \(2021\)mT5: a massively multilingual pre\-trained text\-to\-text transformer\.InProceedings of NAACL,pp\. 483–498\.External Links:[Link](https://aclanthology.org/2021.naacl-main.41)Cited by:[§3\.1](https://arxiv.org/html/2606.15044#S3.SS1.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of ACL,pp\. 4791–4800\.External Links:[Link](https://aclanthology.org/P19-1472)Cited by:[§C\.1](https://arxiv.org/html/2606.15044#A3.SS1.p3.1),[§3\.3\.2](https://arxiv.org/html/2606.15044#S3.SS3.SSS2.p2.1)\.

## Appendix ATraining Data Details

### A\.1Tokenizer Training Data

ISOLanguage\# SentencesCodeenEnglish604,793idIndonesian129,906thThai115,159viVietnamese90,470zhChinese25,683msMalay21,872taTamil5,799tlTagalog3,480myBurmese1,349kmKhmer1,253loLao236Total1,000,000Table 8:Per\-language sentence counts in the tokenizer training dataset\.
### A\.2Language Model Training Data

We source pretraining data from the FineWeb2 corpus\(Penedoet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib9)\), randomly sampling a subset of 100 million sentences \(203 GB\) spanning all eleven SEA languages\. To balance coverage between high\-resource and low\-resource languages, we control the language proportions via temperature sampling, following the approach ofForoutanet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib17)\)\.

##### Temperature Sampling

Sampling each language proportionally to its word count in FineWeb2 would overwhelmingly favor English at 94\.3%\. As such, we sample according to a temperature\-scaled probability:

p\(L\)∝\|L\|1/τ,p\(L\)\\propto\|L\|^\{1/\\tau\},\(1\)wherep\(L\)p\(L\)is the probability of sampling text from languageLLduring pre\-training,\|L\|\|L\|is the number of words in that language in the corpus, andτ\\tauis a temperature parameter\. Whenτ=1\\tau=1, sampling is purely proportional to word frequency\. Asτ\\tauincreases, the distribution becomes increasingly uniform, thereby boosting the relative sampling probability of low\-resource languages\.

We configureτ=1\.21\\tau=1\.21so that English constitutes 89\.7% of the resulting training sentences, matching the proportion of English data used in the Llama 2 pretraining dataset\(Touvronet al\.,[2023](https://arxiv.org/html/2606.15044#bib.bib10)\)\. Table[9](https://arxiv.org/html/2606.15044#A1.T9)shows the raw word frequency of each language in FineWeb2, along with the adjusted frequency after temperature sampling is applied\. For each languageLL, we compute\|L\|1/τ\|L\|^\{1/\\tau\}and normalize across all eleven SEA languages to obtain the final sampling proportion\. These proportions are then used to determine the number of sentences drawn from each language in our 100M\-sentence subset \(Table[10](https://arxiv.org/html/2606.15044#A1.T10)\)\.

LanguageWord frequency \(billion\),\|L\|\|L\|Relative frequency,\|L\|1/τ\|L\|^\{1/\\tau\}ProportionEnglish11,500\.02269\.689\.68%Chinese543\.5182\.27\.20%Indonesian60\.329\.61\.17%Vietnamese50\.925\.71\.02%Thai24\.714\.10\.56%Malay5\.64 \.20\.17%Tamil1\.91\.70\.07%Tagalog1\.61\.50\.06%Burmese0\.90\.90\.03%Khmer0\.70\.70\.03%Lao0\.20\.30\.01%Total12,190\.32,530\.5100\.00%Table 9:Word frequencies in FineWeb2 and relative frequencies after temperature sampling \(τ=1\.21\\tau=1\.21\)\. Temperature sampling boosts the relative proportion of low\-resource SEA languages while keeping English at 89\.7%, matching Llama 2’s pretraining data distribution\.ISOLanguage\# SentencesCodeenEnglish89,689,566zhChinese7,199,747idIndonesian1,169,277viVietnamese1,016,743thThai558,780msMalay165,279taTamil68,252tlTagalog59,364myBurmese34,699kmKhmer28,295loLao9,998Total100,000,000Table 10:Per\-language sentence counts in the language model training dataset after temperature sampling\.

### A\.3Machine Translation Fine\-tuning

Machine translation fine\-tuning is performed via continual pre\-training on English\-XX parallel sentences for one epoch\. The fine\-tuning dataset consists of up to 13 million parallel sentence pairs per language, which were randomly sampled without replacement from NLLB\(Costa\-jussàet al\.,[2024](https://arxiv.org/html/2606.15044#bib.bib18)\)where possible, following the approach ofNguyenet al\.\([2026](https://arxiv.org/html/2606.15044#bib.bib32)\)\. The resulting per\-language sentence counts are shown in Table[11](https://arxiv.org/html/2606.15044#A1.T11)\.

ISOLanguage\# SentencesCode\(million\)zhChinese13\.0idIndonesian13\.0viVietnamese13\.0thThai13\.0msMalay13\.0taTamil13\.0tlTagalog13\.0myBurmese10\.0kmKhmer5\.8loLao4\.2Total111\.0Table 11:Per\-language sentence counts in the fine\-tuning dataset\.Each parallel sentence pair is formatted using the template byQoribet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib25)\):"\{source language\}: \{source sentence\} \\n \{target language\}: \{targetsentence\}"\. To prevent the model from developing a bias toward a fixed source language, the language order of each pair was randomized independently with equal probability\. This means each example has an equal chance of being presented as English\-first or target language\-first\.

## Appendix BIntrinsic Tokenizer Evaluation Metrics

### B\.1Cross\-lingual Equity Metrics

Tokenizer paritymeasures the ratio of the average number of tokens per sentence in a given language relative to English\(Petrovet al\.,[2023](https://arxiv.org/html/2606.15044#bib.bib3)\)\. For a specific languageLLwithkkaligned sentences, we can compute its average number of tokens per sentence,average\_tokensLaverage\\\_tokens\_\{L\}:

1k∑i=1k\# tokens in sentencei\\frac\{1\}\{k\}\\sum\_\{i=1\}^\{k\}\\text\{\\\# tokens~in~sentence\}~i\(2\)
The tokenizer parity for languageLL,parityLparity\_\{L\}, is defined as:

average\_tokensLaverage\_tokensEnglish\\frac\{average\\\_tokens\_\{L\}\}\{average\\\_tokens\_\{English\}\}\(3\)
The macro\-average tokenizer parity for allnnnon\-English languages is computed as:

1n∑j=1nparityj\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}parity\_\{j\}\(4\)This ratio measures whether tokenizers impose computational costs unequally across languages\. A macro\-average tokenizer parity*closer to 1*indicates a more equitable tokenizer across languages\.

Gini coefficientassesses tokenization equity by treating token costs as a distribution\(Foroutanet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib17)\)\. The token cost for a language,cc, is defined as the average number of tokens per sentence for the language in the parallel corpus\. For token costsc1≤c2≤⋯≤cnc\_\{1\}\\leq c\_\{2\}\\leq\\dots\\leq c\_\{n\}acrossnnlanguages, the Gini coefficient is computed as:

1n\(n\+1−2∑i=1n\(n\+1−i\)ci∑i=1nci\)\\frac\{1\}\{n\}\\\!\\left\(n\+1\-2\\,\\frac\{\\sum\_\{i=1\}^\{n\}\(n\+1\-i\)\\,c\_\{i\}\}\{\\sum\_\{i=1\}^\{n\}c\_\{i\}\}\\right\)\(5\)Values range from 0 to 1\. A*lower Gini coefficient*\(closer to 0\) indicates a more equitable tokenizer across languages\.

### B\.2Tokenizer Efficiency Metrics

Compression ratemeasures how efficiently a tokenizer compresses text\. It is defined as the average of the inverse token count per sentence\(Foroutanet al\.,[2025](https://arxiv.org/html/2606.15044#bib.bib17)\)\. For a specific languageLL, its compression rate,rateLrate\_\{L\}is computed as:

1k∑i=1k1\# tokens in sentencei\\frac\{1\}\{k\}\\sum\_\{i=1\}^\{k\}\\frac\{1\}\{\\text\{\\\# tokens~in~sentence\}~i\}\(6\)wherekkis the number of aligned sentences for languageLLin the parallel corpus\. Essentially, we compute a language’s compression rate by evaluating the inverse of the number of tokens per sentence and then averaging it\.

The macro\-average compression rate for allnnlanguages is computed as:

1n∑j=1nratej\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}rate\_\{j\}\(7\)
Utilizing a parallel corpus controls for semantic differences, by comparing token counts over semantically equivalent content\. A*higher macro\-average compression rate*indicates a more efficient tokenizer across languages\. This metric is informative when viewed alongside tokenizer parity, as a high overall compression rate can mask under\-compression of individual low\-resource languages\.

## Appendix CExtrinsic Metrics

### C\.1English Classification Benchmarks

Classification benchmarks evaluate a model’s ability to understand, analyze, and select the correct category from a set of options\. The following English classification benchmarks are used byPagnoniet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib20)\)to evaluate a model’s commonsense reasoning and general world knowledge\.

Physical Intuition Question Answering \(PIQA\)\(Bisket al\.,[2020](https://arxiv.org/html/2606.15044#bib.bib37)\)probes a model’s understanding of everyday physical interactions and how objects behave in the real world\. Each example presents a goal and two solution candidates, with the model tasked to identify the more physically plausible option\.

HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.15044#bib.bib38)\)is a commonsense natural language inference benchmark where a model must select the most plausible continuation of a given scenario from four candidate endings\. The dataset is constructed using adversarial filtering to ensure that a model possesses genuine contextual understanding\.

Arc\-Challenge \(Arc\-C\)\(Clarket al\.,[2018](https://arxiv.org/html/2606.15044#bib.bib22)\)evaluates a model’s scientific reasoning ability through multiple\-choice questions drawn from grade\-school science exams\. The Challenge subset selects questions that simple retrieval\-based and word co\-occurrence methods fail to answer correctly, making it a reliable indicator of deeper reasoning capabilities\.

### C\.2Multilingual Classification Benchmarks

The following multilingual classification benchmarks are used byForoutanet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib17)\)and they collectively span several of our target languages\. They enable a comprehensive evaluation of a model’s cross\-lingual performance on SEA languages\.

Cross\-lingual Natural Language Inference \(XNLI\)\(Conneauet al\.,[2018](https://arxiv.org/html/2606.15044#bib.bib7)\)extends the MultiNLI dataset to 15 languages and serves as a standard benchmark for cross\-lingual natural language understanding\. Models must classify the logical relationship between each pair as one of three categories: entailment, contradiction, or neutral\. This benchmark covers English, Chinese, Thai, and Vietnamese\.

Cross\-lingual Choice of Plausible Alternatives \(XCOPA\)\(Pontiet al\.,[2020](https://arxiv.org/html/2606.15044#bib.bib23)\)is a multilingual benchmark targeting causal commonsense reasoning\. Given a premise, a model must identify either the most plausible cause or effect from two candidate sentences\. XCOPA is evaluated in a zero\-shot setting to assess cross\-lingual transfer without fine\-tuning\. This benchmark covers English, Chinese, Indonesian, Tamil, Thai, and Vietnamese\.

XStoryCloze\(Linet al\.,[2022](https://arxiv.org/html/2606.15044#bib.bib14)\)requires a model to select the correct ending for a four\-sentence narrative from two candidate conclusions in a multilingual setting\. It evaluates cross\-lingual narrative understanding and commonsense reasoning\. This benchmark covers English, Burmese, Chinese, and Indonesian\.

### C\.3Machine Translation

Machine translation is a natural benchmark for evaluating multilingual LLMs, as it tests a model’s ability to understand and generate text across languages\. For SEA languages, translation quality serves as a proxy for how well a model has internalized low\-resource linguistic structure\(Issakaet al\.,[2026](https://arxiv.org/html/2606.15044#bib.bib29)\)\.

Machine translation performance is measured by comparing the machine\-generated output to human reference translations\. Two complementary metrics are typically used, BLEU \(Bilingual Evaluation Understudy\)\(Papineniet al\.,[2002](https://arxiv.org/html/2606.15044#bib.bib21)\)and chrF \(character\-level F\-score\)\(Popović,[2015](https://arxiv.org/html/2606.15044#bib.bib24)\)\. Both metrics have scores ranging from 0 to 100, with*higher scores*indicating better translation quality\. BLEU measures word\-level n\-gram precision against reference translations and was used byPagnoniet al\.\([2025](https://arxiv.org/html/2606.15044#bib.bib20)\)\. chrF computes character n\-gram F\-score and is suitable for morphologically rich languages where word\-level overlap may be sparse and was used byLimisiewiczet al\.\([2024](https://arxiv.org/html/2606.15044#bib.bib13)\)\.

The detailed BLEU and chrF translation scores are shown in Appendix[D\.2](https://arxiv.org/html/2606.15044#A4.SS2)\. The original MYTE paper\(Limisiewiczet al\.,[2024](https://arxiv.org/html/2606.15044#bib.bib13)\)reports scores only for English\-to\-Vietnamese and English\-to\-Tamil translation among SEA languages, and our scores are higher than their reported scores\.

## Appendix DDetailed Results

### D\.1Multilingual Classification Benchmarks

ModelenzhvithAVG\(size\)BLT \(4\.5\)42\.1033\.9934\.2934\.7936\.29MYTE \(90k\)50\.0033\.9944\.9940\.9842\.49PA BPE \(90k\)47\.9433\.9143\.0736\.8140\.43BPE \(90k\)49\.3033\.5143\.0936\.3540\.56Table 12:Per\-language XNLI scores\. Expected accuracy of a random classifier = 33\.33\.

ModelenzhidvithtaAVG\(size\)BLT \(4\.5\)64\.2052\.4052\.6048\.6052\.6049\.4053\.30MYTE \(90k\)54\.4051\.0055\.8056\.6055\.0056\.8054\.93PA BPE \(90k\)71\.4055\.6058\.6060\.2053\.4053\.0058\.70BPE \(90k\)71\.6059\.0061\.6062\.8056\.2055\.0061\.03Table 13:Per\-language XCOPA scores\. Expected accuracy of a random classifier = 50\.00\.

ModelenzhidmyAVG\(size\)BLT \(4\.5\)65\.4554\.2055\.0651\.3656\.52MYTE \(90k\)52\.9549\.8350\.8948\.5150\.55PA BPE \(90k\)65\.2555\.0655\.9250\.5656\.70BPE \(90k\)65\.7854\.8057\.6450\.5057\.18Table 14:Per\-language XStoryCloze scores\. Expected accuracy of a random classifier = 50\.00\.

### D\.2Machine Translation

#### D\.2\.1BLEU scores

ModelzhidvithmstatlmykmloAVG\(size\)BLT \(4\.5\)9\.1827\.0119\.307\.2424\.983\.6511\.041\.922\.311\.5310\.82MYTE \(90k\)18\.4334\.9225\.369\.2527\.665\.6116\.732\.493\.403\.9014\.77PA BPE \(90k\)22\.7826\.4718\.305\.6421\.044\.389\.111\.302\.562\.0111\.36BPE \(90k\)30\.2027\.7232\.516\.9023\.102\.568\.050\.581\.450\.8213\.39Table 15:Per\-language BLEU scores \(EN→\\rightarrowXX\)\.

ModelzhidvithmstatlmykmloAVG\(size\)BLT \(4\.5\)10\.3325\.6321\.948\.4718\.304\.6517\.103\.924\.312\.3311\.70MYTE \(90k\)16\.6530\.9618\.7113\.7023\.903\.7621\.233\.872\.402\.9013\.81PA BPE \(90k\)14\.7526\.0622\.8212\.1717\.295\.3516\.732\.152\.102\.4712\.19BPE \(90k\)14\.4229\.7523\.2411\.2619\.205\.6215\.052\.393\.133\.7712\.78Table 16:Per\-language BLEU scores \(XX→\\rightarrowEN\)\.

#### D\.2\.2chrF scores

ModelzhidvithmstatlmykmloAVG\(size\)BLT \(4\.5\)13\.1648\.7542\.0330\.4845\.4631\.5231\.3519\.9119\.9220\.6330\.32MYTE \(90k\)44\.1359\.8254\.2240\.3747\.2726\.2242\.0222\.4626\.2925\.8938\.87PA BPE \(90k\)15\.2957\.8746\.4624\.1651\.9232\.6244\.7620\.0017\.8219\.5833\.05BPE \(90k\)27\.8667\.4754\.8030\.8162\.0530\.0751\.5614\.8915\.8713\.8036\.92Table 17:Per\-language chrF scores \(EN→\\rightarrowXX\)\.

ModelzhidvithmstatlmykmloAVG\(size\)BLT \(4\.5\)31\.2654\.7843\.1234\.1044\.9823\.8036\.7119\.3319\.5418\.0532\.57MYTE \(90k\)28\.1860\.0440\.1631\.1359\.9623\.4542\.5819\.3820\.5620\.3234\.58PA BPE \(90k\)41\.6051\.9950\.2733\.4342\.8725\.3736\.8520\.0520\.6621\.9934\.51BPE \(90k\)43\.1662\.2253\.3735\.3845\.5228\.8938\.3921\.9324\.1524\.3037\.73Table 18:Per\-language chrF scores \(XX→\\rightarrowEN\)\.
Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models

Similar Articles

Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups

Optimizing Korean-Centric LLMs via Token Pruning

Token maxxing

The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty

Probabilistic Attribution For Large Language Models

Submit Feedback

Similar Articles

Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups
Optimizing Korean-Centric LLMs via Token Pruning
The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty
Probabilistic Attribution For Large Language Models