Neural Machine Translation for Low-Resource Tangkhul--English

arXiv cs.CL Papers

Summary

Presents a neural machine translation system for the severely under-resourced Tangkhul–English language pair, achieving strong BLEU, chrF++, BERTScore, and COMET scores using fine-tuned ByT5-large and mT5-small models.

arXiv:2606.25365v1 Announce Type: new Abstract: We present a study on low-resource machine translation for the Tangkhul-English (nmf-en) language pair. Tangkhul is a severely under-resourced Tibeto-Burman language spoken primarily in Manipur, India, with virtually no prior natural language processing infrastructure. We describe two systems: (1) a primary system based on ByT5-large fine-tuned on 38,336 Tangkhul-English parallel sentence pairs, and (2) a contrastive system based on mT5-small fine-tuned on the same corpus. Our primary ByT5-large system achieves a corpus BLEU score of 39.97, chrF++ of 58.07, BERTScore F1 of 0.8104, and COMET (wmt22-comet-da) of 0.7302 on a held-out test set of 3,856 sentences. We further discuss the orthographic challenges specific to Tangkhul's Latin-script diacritics, the domain bias of our training corpus (which comprises biblical text, stories, and conversational data), and avenues for future improvement through data diversification and domain adaptation.
Original Article
View Cached Full Text

Cached at: 06/25/26, 05:11 AM

# Neural Machine Translation for Low-Resource Tangkhul–English
Source: [https://arxiv.org/html/2606.25365](https://arxiv.org/html/2606.25365)
Chormi Zimik Vashai Independent Researcher Kalamazoo, United States czimik94@gmail\.com Agniva Maiti KIIT University Bhubaneswar, India 2205964@kiit\.ac\.in

###### Abstract

We present a study on low\-resource machine translation for the Tangkhul–English \(nmf–en\) language pair\. Tangkhul is a severely under\-resourced Tibeto\-Burman language spoken primarily in Manipur, India, with virtually no prior natural language processing infrastructure\. We describe two systems: \(1\) a primary system based on ByT5\-large fine\-tuned on 38,336 Tangkhul–English parallel sentence pairs, and \(2\) a contrastive system based on mT5\-small fine\-tuned on the same corpus\. Our primary ByT5\-large system achieves a corpus BLEU score of 39\.97, chrF\+\+ of 58\.07, BERTScore F1 of 0\.8104, and COMET \(wmt22\-comet\-da\) of 0\.7302 on a held\-out test set of 3,856 sentences\. We further discuss the orthographic challenges specific to Tangkhul’s Latin\-script diacritics, the domain bias of our training corpus \(which comprises biblical text, stories, and conversational data\), and avenues for future improvement through data diversification and domain adaptation\.

*K*eywordsLow\-resource machine translation⋅\\cdotTangkhul⋅\\cdotByT5⋅\\cdotmT5⋅\\cdotTibeto\-Burman languages⋅\\cdotNeural machine translation

## 1Introduction

Machine translation \(MT\) for low\-resource languages remains one of the most pressing open problems in natural language processing\. While neural machine translation \(NMT\) has achieved remarkable performance on high\-resource language pairs such as English–German and English–Chinese, the vast majority of the world’s∼\\sim7,000 languages remain effectively invisible to modern NLP systems due to the absence of large\-scale parallel corpora, pretrained representations, and standardised orthographies\.

Tangkhul \(ISO 639\-3: nmf\) is a Sino\-Tibetan language of the Tangkhulic branch, spoken by approximately 150,000–200,000 people\(Ethnologue,[2015](https://arxiv.org/html/2606.25365#bib.bib25); Lisam,[2011](https://arxiv.org/html/2606.25365#bib.bib28)\), predominantly in the Ukhrul district of Manipur, northeast India, with smaller communities in Myanmar\(Ethnologue: Languages of the World,[2016](https://arxiv.org/html/2606.25365#bib.bib32)\)\. The name Tangkhul is an exonym given by the neighbouring Meitei people, widely believed to derive from the Meitei wordstāng\(‘scarce’\) andkhūl\(‘village’\)\(Sanyu,[1996](https://arxiv.org/html/2606.25365#bib.bib26)\), or alternatively fromThan\-khul\(‘Than village’\)\(Shimray,[2001](https://arxiv.org/html/2606.25365#bib.bib27)\)\. The language was first committed to writing in 1897 when the missionary William Pettigrew compiled theTangkhul Primer\(S,[2023](https://arxiv.org/html/2606.25365#bib.bib30); Pettigrew,[1897](https://arxiv.org/html/2606.25365#bib.bib36)\)\. Like many Naga languages, Tangkhul is characterised by SOV \(subject–object–verb\) constituent order, agglutinative morphology, and a system of grammatical tone, though tone is not marked in the standard orthography\(Ahum,[1997](https://arxiv.org/html/2606.25365#bib.bib33)\)\. The language is written in a Latin\-based script that incorporates two phonologically distinctive diacritics: the macron\-above \(ā, Unicode U\+0101\) to mark a long vowel, and the combining macron\-below \(a̱, Unicode U\+0331\) to mark a distinct vowel quality\. These characters, while part of the Unicode standard, fall outside the Basic Multilingual Plane printable ASCII range and create tokenisation challenges for byte\-pair encoding \(BPE\) and SentencePiece vocabularies trained on predominantly European corpora\.

Prior to this work, the Tangkhul language had essentially zero dedicated NLP resources: no publicly available parallel corpora, no trained translation models, no morphological analysers, and no pretrained language models\. Our contribution addresses this gap by \(i\) assembling what is, to the best of our knowledge, the first publicly available large\-scale Tangkhul–English parallel corpus, \(ii\) fine\-tuning two state\-of\-the\-art multilingual sequence\-to\-sequence models on this data, \(iii\) reporting a comprehensive evaluation using BLEU, chrF\+\+, BERTScore, and COMET, and \(iv\) releasing our best model to the research community via the Hugging Face Hub\.

## 2Related Work

### 2\.1Linguistic Profile of Tangkhul

Tangkhul belongs to the Tangkhulic branch of the Sino\-Tibetan language family, encompassing several dialects such as Kabonglo and Lairamlo\(Devi,[2019](https://arxiv.org/html/2606.25365#bib.bib34); Chanu,[2019](https://arxiv.org/html/2606.25365#bib.bib35)\)\. It exhibits extensive agglutinative morphology, where a single verbal complex can encode tense, aspect, mood, agreement, and directionality through a sequence of bound affixes\(Ahum,[1997](https://arxiv.org/html/2606.25365#bib.bib33)\)\. This high degree of morphological synthesis poses a severe sparsity problem for traditional subword tokenisers \(like BPE or SentencePiece\), which rely on finding recurring text fragments in massive corpora\. Since Tangkhul lacks large\-scale unlabelled text corpora for pretraining, subword vocabularies learned from heavily skewed multilingual datasets \(such as the mC4 corpus used for mT5\) fail to capture these morphological boundaries, leading to arbitrary and fragmented token splits\. Furthermore, Tangkhul is a tonal language, though the Latin\-based orthography introduced in the late 19th century does not mark tone, relying instead on contextual disambiguation\. This orthographic ambiguity complicates the translation task, as the sequence\-to\-sequence model must infer semantic intent entirely from surrounding syntactic cues\.

### 2\.2Low\-Resource Neural Machine Translation

Neural machine translation has seen dramatic advances since the introduction of the Transformer architecture\(Vaswani and others,[2017](https://arxiv.org/html/2606.25365#bib.bib20)\)\. However, the gains have been highly unequal across resource levels\. For low\-resource pairs, several strategies have proven effective: transfer learning from multilingual pretrained models\(Zophet al\.,[2016](https://arxiv.org/html/2606.25365#bib.bib24); Guet al\.,[2018](https://arxiv.org/html/2606.25365#bib.bib5)\), data augmentation via back\-translation\(Sennrichet al\.,[2016](https://arxiv.org/html/2606.25365#bib.bib19)\), and unsupervised MT from monolingual data\(Lampleet al\.,[2018](https://arxiv.org/html/2606.25365#bib.bib6); Artetxeet al\.,[2018](https://arxiv.org/html/2606.25365#bib.bib3)\)\. For the specific challenge of Indic and South/Southeast Asian low\-resource languages, the IndicTrans\(Rameshet al\.,[2022](https://arxiv.org/html/2606.25365#bib.bib16)\)and IndicTrans2\(Galaet al\.,[2023](https://arxiv.org/html/2606.25365#bib.bib4)\)systems have established strong multilingual baselines across 22 constitutionally scheduled Indian languages and a number of additional Indic languages\. However, Tangkhul is not included in these systems, reflecting the broader invisibility of Northeast Indian tribal languages in the NLP literature\. The WMT shared tasks have historically focused on European and East Asian language pairs\. In recent years, dedicated low\-resource tracks \(e\.g\., IndicMT at WMT 2023, 2024, 2025\) have begun to incorporate Indic languages, but Northeast Indian Tibeto\-Burman languages remain absent from most benchmarks\.

### 2\.3Byte\-Level and Character\-Level Models

ByT5\(Xue and others,[2022](https://arxiv.org/html/2606.25365#bib.bib22)\)is a byte\-level variant of T5\(Raffel and others,[2020](https://arxiv.org/html/2606.25365#bib.bib14)\)that operates directly on raw UTF\-8 byte sequences, eliminating the need for a vocabulary and making it inherently robust to any written language, character set, or orthography\. ByT5 is particularly well\-suited to low\-resource and morphologically rich languages, where subword tokenisation may produce overly fragmented representations or fail to capture productive morphological processes\. For truly zero\-resource scenarios, ByT5’s tokenisation\-free approach means it can immediately process any text in any script without preprocessing\.

mT5\(Xue and others,[2021](https://arxiv.org/html/2606.25365#bib.bib21)\)is a multilingual variant of T5 pretrained on the mC4 corpus covering 101 languages\. It uses SentencePiece unigram language model tokenisation with a shared vocabulary of 250,112 tokens \(the base 250,100 tokens plus 12 added special tokens in the Hugging Face configuration\)\. While mT5 includes coverage for the Latin script and common diacritics, its pretraining data contains no Tangkhul text, making it a zero\-shot baseline without fine\-tuning\. Both models have been applied to low\-resource MT in previous work\.Adelaniet al\.\([2022](https://arxiv.org/html/2606.25365#bib.bib1)\)demonstrated that mT5 and ByT5 achieve competitive performance on African low\-resource languages\.Edmanet al\.\([2024](https://arxiv.org/html/2606.25365#bib.bib37)\)showed that character\-level and byte\-level models offer distinct advantages in translation quality over subword models like mT5, particularly for rare words and morphologically complex settings\.

### 2\.4Biblical Corpora in Low\-Resource NLP

Parallel Bible corpora have long served as a multilingual resource of last resort for low\-resource languages\. The Parallel Bible Corpus\(Mayer and Cysouw,[2014](https://arxiv.org/html/2606.25365#bib.bib9)\)covers over 1,000 languages\. Similarly,Agić and Vulić \([2019](https://arxiv.org/html/2606.25365#bib.bib2)\)compiled the JW300 parallel corpus from Jehovah’s Witnesses publications, covering over 300 low\-resource languages\. The key limitation of such religious\-derived training data is domain specificity: the vocabulary, register, and syntactic constructions differ substantially from everyday conversational or news language\. We acknowledge this limitation explicitly in our analysis \(Section[6\.5](https://arxiv.org/html/2606.25365#S6.SS5)\)\.

## 3Dataset

### 3\.1Data Collection and Structure

Our corpus consists of aligned Tangkhul–English sentence pairs derived primarily from a parallel Bible translation, supplemented with stories and conversational data \(from row 31,095 onwards\)\. The raw data was compiled and cleaned by the project team and stored in a spreadsheet \(XLSX\) with the following columns:

- •verse\_text\_t: The Tangkhul source sentence \(one Bible verse per row\)
- •verse\_text\_e: The corresponding English translation

After loading the spreadsheet, we applied the following preprocessing pipeline:

1. 1\.Whitespace normalisation: All runs of whitespace characters replaced with a single space\.
2. 2\.Tangkhul character filtering: Retained only printable ASCII characters \(U\+0020–U\+007E\), the long\-vowel macron ā \(U\+0101\), and the combining macron\-below \(a̱, U\+0331\)\. All other Unicode characters \(primarily punctuation artefacts and encoding errors\) were removed\.
3. 3\.English character filtering: Retained only printable ASCII characters\.
4. 4\.Deduplication: Duplicate source\-side sentences were removed\.
5. 5\.Empty\-row removal: Rows where either column was null or empty after cleaning were discarded\.

The final cleaned corpus contains 38,336 sentence pairs\. Table[1](https://arxiv.org/html/2606.25365#S3.T1)summarises corpus statistics\.111Vocabulary sizes in Table[1](https://arxiv.org/html/2606.25365#S3.T1)were estimated using simple whitespace tokenisation and lowercasing\. The relatively large Tangkhul vocabulary \(∼\\sim18,200 types\) compared to English \(∼\\sim14,300 types\) reflects Tangkhul’s highly agglutinative morphology\.

Table 1:Corpus Statistics1\-56\-1011\-1516\-2021\-2526\+0\.50\.511⋅104\\cdot 10^\{4\}Tokens per SentenceNumber of SentencesTangkhulEnglish

Figure 1:Distribution of token counts per sentence in the Tangkhul–English parallel corpus\.
### 3\.2Dataset Split

We adopted an 90/5/5 train/validation/test split with a fixed random seed \(42\) using stratified random splitting to preserve the distribution of verse lengths across partitions\. For the full\-corpus evaluation using our inference pipeline, we evaluated on 10% of the full cleaned dataset \(3,856 sentences\) to provide a larger, more statistically stable estimate of model performance\.

Table 2:Dataset Splits\. Primary test is 1,917 sentences; full\-corpus evaluation uses 3,856 sentences\.
### 3\.3Orthographic Considerations

Tangkhul’s two non\-ASCII diacritics present non\-trivial tokenisation challenges:

- •ByT5: Operates at the byte level, so ā \(2 bytes\) and a̱ \(base letter \+ 1 combining byte\) are naturally handled as short byte sequences\. No special preprocessing is required\.
- •mT5/SentencePiece: The SentencePiece vocabulary trained on mC4 includes ā but lacks the combining macron\-below \(a̱\)\. Because mT5 uses byte\-fallback, it splits the unrecognized a̱ grapheme into its base characteraand raw UTF\-8 byte tokens \(<0xCC\><0xB1\>\)\. This fragmentation separates the phonetic modifier from its base character and increases effective sequence lengths by approximately 10–15% for Tangkhul text\.

Table[3](https://arxiv.org/html/2606.25365#S3.T3)provides a concrete example of this phenomenon, illustrating how mT5’s SentencePiece tokenizer fragments a common Tangkhul word containing the a̱ diacritic compared to ByT5’s clean byte\-level representation\.

Table 3:Tokenisation of the Tangkhul wordtara̱\(‘water’\)\. The combining macron\-below falls back to byte tokens in mT5, separating it from the base charactera\. ByT5 encodes the entire sequence natively as bytes\.

## 4System Description

We developed two systems for this task\.

### 4\.1Primary System: ByT5\-large \(Fine\-tuned\)

#### Architecture\.

ByT5\-large is an encoder\-decoder Transformer with approximately 1\.23 billion parameters\. It processes raw UTF\-8 byte sequences without any tokenisation step, with a fixed vocabulary of 259 byte values \(0–255 plus three special tokens\)\.

Tangkhul Input:tara̱\(‘water’\)SentencePiece \(mT5\)\[tar\]\[a\]<0xCC\><0xB1\>Subword Encoder\(Standard Self\-Attention\)Subword SplitUTF\-8 Bytes \(ByT5\)74617261ccb1Byte Encoder\(Local Attention \+ Deep\)Raw UTF\-8 EncodingSuboptimal fragmentation\(diacritic separated into byte tokens\)Full phonetic and orthographicinformation retained cleanlyFigure 2:Comparison of subword\-level \(mT5\) versus byte\-level \(ByT5\) representation for Tangkhul words with diacritics \(e\.g\.,tara̱\)\. Subword tokenisers fragment rare combining diacritics into fallback byte tokens, separating the phonetic modifier from its base character\. ByT5 natively handles these characters uniformly as multi\-byte sequences\.Model Name:tangkhul\-byt5

#### Preprocessing\.

As described in Section[3](https://arxiv.org/html/2606.25365#S3)\. The task prefix"translate Tangkhul to English: "was prepended to every source sentence, following the standard T5 instruction format\.

#### Training Configuration\.

The model was fine\-tuned from thegoogle/byt5\-largepretrained checkpoint\. Key hyperparameters are reported in Table[4](https://arxiv.org/html/2606.25365#S4.T4)\.

Table 4:ByT5\-large Training Hyperparameters
#### Inference\.

At inference time, beam search withnum\_beams=4andearly\_stopping=Truewas used\. Translations were decoded from byte sequences back to UTF\-8 strings\.

#### Model Release\.

The fine\-tuned ByT5\-large modeltangkhul\-byt5and a live demo Gradio interface are publicly available\.

### 4\.2Contrastive System: mT5\-small \(Fine\-tuned\)

#### Architecture\.

We fine\-tunedgoogle/mt5\-small\(300M parameters\) as a contrastive system to investigate the trade\-off between byte\-level and subword\-level representations, and to evaluate how our newly collected dataset would perform when fine\-tuned on a model with a significantly smaller parameter count and a different architecture\. mT5\-small uses SentencePiece tokenisation with a 250,112\-token shared vocabulary \(including special tokens\)\.

Model Name:tangkhul\-mt5

#### Training Configuration\.

Full training hyperparameters are given in Table[5](https://arxiv.org/html/2606.25365#S4.T5)\.

Table 5:mT5\-small Training Hyperparameters
#### Training Dynamics\.

While the training run lasted for 25 epochs, the best validation BLEU checkpoint was achieved at epoch 24\. We report all hyperparameters and results based on this epoch 24 checkpoint\.

#### Inference\.

Beam search withnum\_beams=5,max\_length=128,no\_repeat\_ngram\_size=3, andlength\_penalty=1\.0\.

### 4\.3Zero\-Shot Baseline

To contextualise our fine\-tuned systems, we also evaluated the unmodifiedgoogle/mt5\-base\(580M parameters\) in a zero\-shot setting on 200 test sentences, using the same task prefix but without any Tangkhul\-specific fine\-tuning\.

## 5Experimental Setup

### 5\.1Evaluation Metrics

We evaluated our systems using four automatic metrics:

1. 1\.BLEU\(Papineniet al\.,[2002](https://arxiv.org/html/2606.25365#bib.bib10)\): Corpus\-level BLEU computed via SacreBLEU\(Post,[2018](https://arxiv.org/html/2606.25365#bib.bib13)\)with the default tokenisation\.
2. 2\.chrF\+\+\(Popović,[2015](https://arxiv.org/html/2606.25365#bib.bib11),[2017](https://arxiv.org/html/2606.25365#bib.bib12)\): Character n\-gram F\-score with word\-level unigrams added \(word\_order=2\)\.
3. 3\.BERTScore F1\(Zhanget al\.,[2020](https://arxiv.org/html/2606.25365#bib.bib23)\): Computes token\-level cosine similarity using contextual BERT embeddings \(bert\-base\-uncased\)\.
4. 4\.COMET\(Reiet al\.,[2020](https://arxiv.org/html/2606.25365#bib.bib17); Rei and others,[2022](https://arxiv.org/html/2606.25365#bib.bib18)\): We used theUnbabel/wmt22\-comet\-dareference\-based model, which is trained on direct assessment \(DA\) human judgements\.

### 5\.2Evaluation Infrastructure

- •SacreBLEU 2\.x via thesacrebleuPython package
- •evaluatelibrary\(Lhoest and others,[2021](https://arxiv.org/html/2606.25365#bib.bib7)\)for BLEU and chrF\+\+
- •bert\-scorepackage for BERTScore
- •unbabel\-comet\(v2\.x\) for COMET, computed withbatch\_size=64on a GPU

## 6Results

### 6\.1Main Results

Table[6](https://arxiv.org/html/2606.25365#S6.T6)presents our primary evaluation results on the held\-out test sets\.

Table 6:Main Evaluation Results \(Tangkhul→\\rightarrowEnglish\), evaluated on the full 3,856\-sentence test set\.The results demonstrate several key findings:

ByT5\-large substantially outperforms mT5\-small by \+27\.76 BLEU points \(39\.97 vs\. 12\.21\) and \+27\.88 chrF\+\+ points\. This large gap reflects both the parameter count advantage \(1\.2B vs\. 300M\) and the suitability of byte\-level processing for Tangkhul’s diacritised Latin orthography\.

Zero\-shot transfer is essentially non\-functional for Tangkhul \(BLEU 0\.03\), confirming that even large multilingual pretrained models acquire no meaningful Tangkhul representations from pretraining alone\.

In addition to the primary surface\-level metrics reported in Table[6](https://arxiv.org/html/2606.25365#S6.T6), we also computed deep semantic metrics exclusively for our primary ByT5\-large system, which achieved a COMET score of 0\.7302 and a BERTScore F1 of 0\.8104\. Because our task is Tangkhul→\\rightarrowEnglish translation, these metrics primarily evaluate the semantic equivalence between the generated English hypothesis and the English reference\. While cross\-language comparisons of absolute COMET scores \(e\.g\., comparing to high\-resource systems\) should be avoided due to the metric models’ lack of prior exposure to the Tangkhul source text, these scores establish a robust initial baseline for future Tangkhul NLP research\.

### 6\.2Inference Hyperparameter Ablation

To determine the optimal inference parameters for our primary ByT5\-large model, we conducted an ablation study varying the beam search width\. Figure[3](https://arxiv.org/html/2606.25365#S6.F3)illustrates the trade\-off between translation quality \(BLEU\) and relative inference time as the beam size increases\. The graph shows diminishing returns in BLEU score for beam sizes larger than 4, while inference time scales almost linearly\. Consequently, we selected a beam size of 4 \(withnum\_beams=4\) for our standard evaluation pipeline to balance accuracy and decoding speed\.

112244668810103838393940404141Beam SizeBLEU Score0224466Relative Inference TimeBLEU ScoreInference Time

Figure 3:Effect of beam size on BLEU score and relative inference time\. A beam size of 4–5 offers the best trade\-off between translation quality and computational efficiency\.
### 6\.3Sentence\-Level Analysis

Due to resource constraints during evaluation, we computed sentence\-level BLEU scores specifically for the mT5\-small system to understand baseline characteristics\. While we expect our primary ByT5\-large system to follow a similar qualitative distribution shifted higher, computing its full sentence\-level statistics remains future work\. For the mT5\-small system, the mean sentence BLEU was 11\.69 with a median of 7\.35, indicating a right\-skewed distribution \(note that the arithmetic mean of sentence BLEU scores differs methodologically from the corpus BLEU of 12\.21 reported in Table[6](https://arxiv.org/html/2606.25365#S6.T6), which aggregates n\-gram matches globally\)\. High\-scoring sentences tend to be shorter, contain common biblical formulae, or involve proper nouns that are transliterated identically \(e\.g\.,Jesus,Israel,Elijah\)\. Low\-scoring sentences typically involve complex verbal morphology or culturally specific terms\.

Table 7:Examples of mT5\-small translations, demonstrating hallucinated repetition loops and truncation issues\.Table 8:Examples of ByT5\-large translations, showing higher fluency and accuracy\.
### 6\.4Preliminary Qualitative Exploration of Ensemble Re\-Ranking

In an attempt to improve the translation quality, we experimented with an ensemble re\-ranking approach combining both mT5 and ByT5 scores\. We generated candidate translations and scored them using both models, selecting the candidate with the highest average score\.

However, in several qualitative examples, we found that this ensemble approach amplified hallucinations and repetition loops rather than mitigating them\. For instance, given the sourceHaokaphokli Varena kazing eina ngalei sai\., the mT5 prediction was “God built the land with the heavens\.” When employing the ensemble re\-ranking, the selected translation was “God made the land of the heavens, and the land of the earth\.” While accurate candidates \(e\.g\., “In the beginning, God created heaven and earth\.”\) received high scores from ByT5, they were heavily penalized by mT5\. Ultimately, the ensemble selected a repetitive and hallucinatory candidate that satisfied the average score threshold of both models, negating the strengths of ByT5\.

### 6\.5Domain Effects

While our corpus includes conversational data and stories, it is drawn predominantly from the Tangkhul Bible translation, which introduces several systematic biases to the majority of the dataset:

1. 1\.Lexical coverage: Biblical vocabulary is dominated by religious and archaic register terms\.
2. 2\.Syntactic bias: Biblical English follows a rigid, archaic syntactic style with frequent use of passive voice and formal sentence structure\.
3. 3\.Named entity density: A disproportionate fraction of source tokens are proper nouns\.
4. 4\.Repetitive structures: Biblical text contains many formulaic repeated phrases\.

### 6\.6Structured Error Analysis

To better understand the limitations of the ByT5\-large model, we conducted a manual error analysis on a random sample of 100 translated sentences from the test set\. We categorised the primary failure modes into a preliminary qualitative error taxonomy, detailed in Table[9](https://arxiv.org/html/2606.25365#S6.T9)\. Lexical substitution and stylistic hallucination emerged as the two most salient qualitative failure modes in this sample\.

Table 9:Preliminary Qualitative Error Taxonomy with illustrative examples of common ByT5\-large failure modes\.As seen in Table[9](https://arxiv.org/html/2606.25365#S6.T9), the model exhibits several interesting failure modes when translating conversational Tangkhul\. Lexical Substitution occurs when the model swaps core nouns or adjectives, for example translating ‘water’ \(taru\) to ‘milk’ and ‘more’ \(chungda\) to ‘a little’\. Furthermore, Stylistic Hallucination impacts the generalisation of the model to conversational text\. When presented with casual phrases like “Stop hanging out at night”, the model dramatically extrapolates the tone, translating it as a question: “No night exploration again?”\.

## 7Limitations

Domain mismatch: Although our model includes conversational and story data, it is trained predominantly on biblical text and may generalise poorly to certain modern domains\.

Evaluation metric limitations: Automatic metrics, including COMET, are imperfect proxies for human translation quality\.

Single direction: We trained and evaluated primarily in the Tangkhul→\\rightarrowEnglish direction\. English→\\rightarrowTangkhul MT is equally important but presents additional challenges including hallucination of diacritics\.

Data scale: 38,336 sentence pairs is a large corpus by the standards of zero\-resource NLP but is still three orders of magnitude smaller than the training data available for high\-resource pairs\.

## 8Conclusion

We have presented our work on low\-resource Tangkhul–English machine translation, to our knowledge the first dedicated MT system publicly released for this language\. Our primary system, a ByT5\-large model fine\-tuned on 38,336 parallel sentence pairs \(comprising biblical, conversational, and story data\), achieves a BLEU score of 39\.97, chrF\+\+ of 58\.07, BERTScore F1 of 0\.8104, and COMET of 0\.7302\. Our contrastive mT5\-small system achieves BLEU 12\.21, and a zero\-shot mT5\-base achieves effectively zero BLEU \(0\.03\)\. The byte\-level processing of ByT5\-large proves highly advantageous for Tangkhul’s diacritised Latin script, handling the language’s special characters natively\. We release our best model \(tangkhul\-byt5\) and the fine\-tuned mT5 \(tangkhul\-mt5\) to facilitate future research\. Critical next steps include expanding the corpus further into non\-biblical domains, using back\-translation to augment training data, and extending to English→\\rightarrowTangkhul translation\.

## Acknowledgements

We thank the Tangkhul community for their invaluable linguistic resources\.

## References

- \[1\]D\. I\. Adelani, J\. Alabi, A\. Fan, J\. Kreutzer, X\. Shen, M\. Reid, D\. Ruiter, D\. Klakow, P\. Nabende, E\. Chang,et al\.\(2022\)A few thousand translations go a long way\! leveraging pre\-trained models for african news translation\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 3053–3070\.Cited by:[§2\.3](https://arxiv.org/html/2606.25365#S2.SS3.p2.1)\.
- \[2\]Ž\. Agić and I\. Vulić\(2019\)JW300: a wide\-coverage parallel corpus for low\-resource languages\.InProceedings of ACL 2019,pp\. 3204–3210\.Cited by:[§2\.4](https://arxiv.org/html/2606.25365#S2.SS4.p1.1)\.
- \[3\]V\. Ahum\(1997\)Tangkhul\-Naga grammar: a study of word formation\.Ph\.D\. Thesis,Jawaharlal Nehru University,New Delhi\.Cited by:[§1](https://arxiv.org/html/2606.25365#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.25365#S2.SS1.p1.1)\.
- \[4\]M\. Artetxe, G\. Labaka, E\. Agirre, and K\. Cho\(2018\)Unsupervised neural machine translation\.InProceedings of ICLR 2018,Cited by:[§2\.2](https://arxiv.org/html/2606.25365#S2.SS2.p1.1)\.
- \[5\]A\. L\. Chanu\(2019\)A descriptive grammar of lairamlo: a dialect of Tangkhul\.Ph\.D\. Thesis,Assam University\.Note:hdl:10603/355393External Links:[Link](https://shodhganga.inflibnet.ac.in/handle/10603/355393)Cited by:[§2\.1](https://arxiv.org/html/2606.25365#S2.SS1.p1.1)\.
- \[6\]L\. S\. Devi\(2019\)A descriptive grammar of kabonglo: a dialect of Tangkhul\.Ph\.D\. Thesis,Assam University\.Note:hdl:10603/355391External Links:[Link](https://shodhganga.inflibnet.ac.in/handle/10603/355391)Cited by:[§2\.1](https://arxiv.org/html/2606.25365#S2.SS1.p1.1)\.
- \[7\]L\. Edman, G\. Sarti, A\. Toral, G\. van Noord, and A\. Bisazza\(2024\)Are character\-level translations worth the wait? comparing ByT5 and mT5 for machine translation\.Transactions of the Association for Computational Linguistics12,pp\. 392–410\.Cited by:[§2\.3](https://arxiv.org/html/2606.25365#S2.SS3.p2.1)\.
- \[8\]Ethnologue: Languages of the World\(2016\)Myanmar\.Note:[https://web\.archive\.org/web/20161010180533/http://www\.ethnologue\.com/country/MM/languages](https://web.archive.org/web/20161010180533/http://www.ethnologue.com/country/MM/languages)Archived from the original on 10 October 2016Cited by:[§1](https://arxiv.org/html/2606.25365#S1.p2.1)\.
- \[9\]Ethnologue\(2015\)Tangkhul\.18th edition\.Note:[https://www\.ethnologue\.com/18/language/nmf/](https://www.ethnologue.com/18/language/nmf/)Subscription requiredCited by:[§1](https://arxiv.org/html/2606.25365#S1.p2.1)\.
- \[10\]J\. Gala, P\. A\. Chitale, R\. Ak, V\. Gumma, S\. Doddapaneni, A\. Kumar, J\. Nawale, A\. Sujatha, R\. Puduppully, V\. Raghavan,et al\.\(2023\)Indictrans2: towards high\-quality and accessible machine translation models for all 22 scheduled indian languages\.arXiv preprint arXiv:2305\.16307\.Cited by:[§2\.2](https://arxiv.org/html/2606.25365#S2.SS2.p1.1)\.
- \[11\]J\. Gu, H\. Hassan, J\. Devlin, and V\. O\. Li\(2018\)Universal neural machine translation for extremely low resource languages\.InProceedings of NAACL\-HLT 2018,pp\. 344–354\.Cited by:[§2\.2](https://arxiv.org/html/2606.25365#S2.SS2.p1.1)\.
- \[12\]G\. Lample, A\. Conneau, L\. Denoyer, and M\. Ranzato\(2018\)Unsupervised machine translation using monolingual corpora only\.InProceedings of ICLR 2018,Cited by:[§2\.2](https://arxiv.org/html/2606.25365#S2.SS2.p1.1)\.
- \[13\]Q\. Lhoestet al\.\(2021\)Datasets: a community library for natural language processing\.InProceedings of EMNLP 2021: System Demonstrations,pp\. 175–184\.Cited by:[2nd item](https://arxiv.org/html/2606.25365#S5.I2.i2.p1.1)\.
- \[14\]K\. S\. Lisam\(2011\)Encyclopaedia of manipur\.Vol\.3,Gyan Publishing House\.External Links:ISBN 978\-81\-7835\-864\-2Cited by:[§1](https://arxiv.org/html/2606.25365#S1.p2.1)\.
- \[15\]T\. Mayer and M\. Cysouw\(2014\)Creating a massively parallel bible corpus\.InProceedings of LREC 2014,pp\. 3158–3163\.Cited by:[§2\.4](https://arxiv.org/html/2606.25365#S2.SS4.p1.1)\.
- \[16\]K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu\(2002\)BLEU: a method for automatic evaluation of machine translation\.InProceedings of ACL 2002,pp\. 311–318\.Cited by:[item 1](https://arxiv.org/html/2606.25365#S5.I1.i1.p1.1)\.
- \[17\]W\. Pettigrew\(1897\)Tangkhul primer and catechism\.Cited by:[§1](https://arxiv.org/html/2606.25365#S1.p2.1)\.
- \[18\]M\. Popović\(2015\)ChrF: character n\-gram f\-score for automatic mt evaluation\.InProceedings of the Tenth Workshop on Statistical Machine Translation,pp\. 392–395\.Cited by:[item 2](https://arxiv.org/html/2606.25365#S5.I1.i2.p1.1)\.
- \[19\]M\. Popović\(2017\)ChrF\+\+: words helping character n\-grams\.InProceedings of the Second Conference on Machine Translation \(WMT17\),pp\. 612–618\.Cited by:[item 2](https://arxiv.org/html/2606.25365#S5.I1.i2.p1.1)\.
- \[20\]M\. Post\(2018\)A call for clarity in reporting BLEU scores\.InProceedings of the Third Conference on Machine Translation \(WMT18\): Research Papers,pp\. 186–191\.Cited by:[item 1](https://arxiv.org/html/2606.25365#S5.I1.i1.p1.1)\.
- \[21\]C\. Raffelet al\.\(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.Cited by:[§2\.3](https://arxiv.org/html/2606.25365#S2.SS3.p1.1)\.
- \[22\]G\. Ramesh, S\. Doddapaneni, A\. Bheemaraj, M\. Jobanputra, R\. AK, A\. Sharma, S\. Sahoo, H\. Diddee, D\. Kakwani, N\. Kumar, A\. Majumder, D\. Raman, V\. Jain, s\. tiwary, M\. Yadav, A\. Kunchukuttan, P\. Ramesh, J\. Gala, S\. Doshi, P\. M M, V\. Kharde, S\. V, S\. Prakhya, A\. Madasu, R\. Agrawal, P\. S, S\. H, A\. K, M\. M\. Khapra, and P\. Kumar\(2022\)IndicTrans: towards neural machine translation for 22 Indic languages\.InFindings of the Association for Computational Linguistics: EMNLP 2022,pp\. 1–13\.Cited by:[§2\.2](https://arxiv.org/html/2606.25365#S2.SS2.p1.1)\.
- \[23\]R\. Reiet al\.\(2022\)COMET\-22: unbabel\-ist 2022 submission for the metrics shared task\.InProceedings of WMT22,pp\. 578–585\.Cited by:[item 4](https://arxiv.org/html/2606.25365#S5.I1.i4.p1.1)\.
- \[24\]R\. Rei, C\. Stewart, A\. C\. Farinha, and A\. Lavie\(2020\)COMET: a neural framework for mt evaluation\.InProceedings of EMNLP 2020,pp\. 2685–2695\.Cited by:[item 4](https://arxiv.org/html/2606.25365#S5.I1.i4.p1.1)\.
- \[25\]V\. S\. K\. S\(2023\-11\)Manipur: literature festival strives to promote Tangkhul language\.EastMojo\.Note:[http://www\.eastmojo\.com/manipur/2023/11/26/manipur\-literature\-festival\-strives\-to\-promote\-tangkhul\-language/](http://www.eastmojo.com/manipur/2023/11/26/manipur-literature-festival-strives-to-promote-tangkhul-language/)Retrieved 27 November 2023Cited by:[§1](https://arxiv.org/html/2606.25365#S1.p2.1)\.
- \[26\]V\. Sanyu\(1996\)A history of nagas and nagaland: dynamics of oral tradition in village formation\.Commonwealth Publishers\.External Links:ISBN 978\-81\-7169\-369\-6Cited by:[§1](https://arxiv.org/html/2606.25365#S1.p2.1)\.
- \[27\]R\. Sennrich, B\. Haddow, and A\. Birch\(2016\)Improving neural machine translation models with monolingual data\.InProceedings of ACL 2016,pp\. 86–96\.Cited by:[§2\.2](https://arxiv.org/html/2606.25365#S2.SS2.p1.1)\.
- \[28\]A\. S\. W\. Shimray\(2001\)History of the tangkhul nagas\.Akansha Publishing House\.External Links:ISBN 978\-81\-87606\-04\-8Cited by:[§1](https://arxiv.org/html/2606.25365#S1.p2.1)\.
- \[29\]A\. Vaswaniet al\.\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,pp\. 5998–6008\.Cited by:[§2\.2](https://arxiv.org/html/2606.25365#S2.SS2.p1.1)\.
- \[30\]L\. Xueet al\.\(2021\)MT5: a massively multilingual pre\-trained text\-to\-text transformer\.InProceedings of NAACL 2021,pp\. 483–498\.Cited by:[§2\.3](https://arxiv.org/html/2606.25365#S2.SS3.p2.1)\.
- \[31\]L\. Xueet al\.\(2022\)ByT5: towards a token\-free future with pre\-trained byte\-to\-byte models\.Transactions of the Association for Computational Linguistics10,pp\. 291–306\.Cited by:[§2\.3](https://arxiv.org/html/2606.25365#S2.SS3.p1.1)\.
- \[32\]T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi\(2020\)BERTScore: evaluating text generation with BERT\.InProceedings of ICLR 2020,Cited by:[item 3](https://arxiv.org/html/2606.25365#S5.I1.i3.p1.1)\.
- \[33\]B\. Zoph, D\. Yuret, J\. May, and K\. Knight\(2016\)Transfer learning for low\-resource neural machine translation\.InProceedings of EMNLP 2016,pp\. 1568–1575\.Cited by:[§2\.2](https://arxiv.org/html/2606.25365#S2.SS2.p1.1)\.

Similar Articles

tencent/Hy-MT2-7B

Hugging Face Models Trending

Tencent open-sourced the Hy-MT2 family of fast-thinking multilingual translation models (1.8B, 7B, 30B-A3B) supporting 33 languages, along with extreme quantization for on-device deployment and a new instruction-following benchmark IFMTBench.

@FeitengLi: Hy-MT2 - a new open-source multilingual translation model that matches top-tier large models in capability, supports translation between 33 languages, and offers flexible instruction capabilities. It achieves 2-bit quantization under 500MB, making it well-suited for on-device deployment. https://modelsc…

X AI KOLs Timeline

Hy-MT2 is a new open-source multilingual translation model from Tencent Hy that supports 33 languages, offers flexible instruction capabilities, and achieves 2-bit quantization under 500MB for on-device deployment.

AngelSlim/Hy-MT1.5-1.8B-1.25bit

Hugging Face Models Trending

Tencent's AngelSlim team released Hy-MT1.5-1.8B-1.25bit, a highly compressed 1.25-bit machine translation model supporting 33 languages that fits in 440MB for on-device use. It utilizes the Sherry quantization algorithm to achieve world-class translation quality comparable to much larger models.