Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

arXiv cs.CL 05/22/26, 04:00 AM Models
multilingual-translation machine-translation tencent open-source nlp fast-thinking moe
Summary
Hy-MT2 is a family of fast, efficient multilingual translation models from Tencent, available in 1.8B, 7B, and 30B-A3B sizes, supporting 33 languages and outperforming previous open-source and commercial models.
arXiv:2605.22064v1 Announce Type: new Abstract: Hy-MT2 is a family of fast-thinking multilingual translation models designed for complex real-world scenarios. It includes three model sizes: 1.8B, 7B, and 30B-A3B (MoE), all of which support translation among 33 languages and effectively follow translation instructions in multiple languages. For on-device deployment, with AngelSlim 1.25-bit extreme quantization, the 1.8B model requires only 440 MB of storage and improves inference speed by 1.5x. Multi-dimensional evaluations show that Hy-MT2 delivers outstanding performance across general, real-world business, domain-specific, and instruction-following translation tasks. The 7B and 30B models outperform open-source models such as DeepSeek-V4-Pro and Kimi K2.6 in fast-thinking mode, while the lightweight 1.8B model also surpasses mainstream commercial APIs from providers such as Microsoft and Doubao overall.
Original Article
View Cached Full Text
Cached at: 05/22/26, 08:45 AM
# A Family of Fast, Efficient and PowerfulMultilingual Translation Models in the Wild
Source: [https://arxiv.org/html/2605.22064](https://arxiv.org/html/2605.22064)
## ![[Uncaptioned image]](https://arxiv.org/html/2605.22064v1/logo/equipped_lion_stage_3_teen_headset.png)Hy\-MT2: A Family of Fast, Efficient and PowerfulMultilingual Translation Models in the Wild

###### Abstract

Hy\-MT2 is a family offast\-thinking multilingual translation modelsdesigned for complex real\-world scenarios\. It includes three model sizes:1\.8B, 7B, and 30B\-A3B \(MoE\), all of which support translation among33languages and effectivelyfollow translation instructions in multiple languages\. For on\-device deployment, with AngelSlim1\.25\-bitextreme quantization, the1\.8Bmodel requires only440MB of storage and improves inference speed by1\.5x\. Multi\-dimensional evaluations show that Hy\-MT2 delivers outstanding performance across general, real\-world business, domain\-specific, and instruction\-following translation tasks\. The 7B and 30B modelsoutperform open\-source models such as DeepSeek\-V4\-Pro and Kimi K2\.6 in fast\-thinking mode, while the lightweight 1\.8B model alsosurpasses mainstream commercial APIs from providers such as Microsoft and Doubao overall\.

![[Uncaptioned image]](https://arxiv.org/html/2605.22064v1/logo/hugging.png)[https://huggingface\.co/collections/tencent/hy\-mt2](https://huggingface.co/collections/tencent/hy-mt2)

![[Uncaptioned image]](https://arxiv.org/html/2605.22064v1/logo/github.png)[https://github\.com/Tencent\-Hunyuan/Hy\-MT2](https://github.com/Tencent-Hunyuan/Hy-MT2)

![Refer to caption](https://arxiv.org/html/2605.22064v1/x1.png)Figure 1:Benchmark performance of Hy\-MT2 models and state\-of\-the\-art baselines\.## 1From Hy\-MT1\.5 to Hy\-MT2

After its release, Hy\-MT1\.5\(Zheng et al\.,[2025](https://arxiv.org/html/2605.22064#bib.bib15)\)attracted broad attention from both the open\-source community and real\-world business applications\. As the model was adopted in more practical translation scenarios, community and business feedback revealed that Hy\-MT1\.5 still had room for improvement in domain\-specific translation, real\-world scenario translation, translation instruction following, and efficient on\-device deployment\. Meanwhile, the substantial quality improvements achieved by Hy3\-preview further motivated us to leverage it as a strong teacher model to improve the performance of the Hy translation model\. To further address these limitations, we propose the Hy\-MT2 model family\.

First, domain\-specific and real\-world scenario translation remain challenging for Hy\-MT1\.5\. Professional domains such as finance, law, and medicine contain a large number of domain\-specific terms and established industry translations, placing higher requirements on translation accuracy and consistency\. Real\-world business scenarios, such as webpages, meetings, and social content, involve more diverse text formats and usage requirements\. To address these challenges, Hy\-MT2 strengthens translation capabilities for professional domains and real\-world application scenarios, enabling the model to better adapt to translation needs across different domains, sources, and text forms\.

Second, in practical use, users often impose additional constraints on translation outputs, such as keeping certain words untranslated, controlling the translation style, or producing outputs according to a specified template\. In such scenarios, Hy\-MT1\.5 may ignore constraints or fail to satisfy the specified requirements\. To this end, Hy\-MT2 enhances multilingual translation instruction understanding and execution, enabling the model to reliably follow user requirements in different languages, including those related to style, format, and other translation constraints, as illustrated in Table[1](https://arxiv.org/html/2605.22064#S1.T1)\.

In addition, community feedback indicates that Hy\-MT1\.5\-7B still has a clear gap in translation quality compared with the strongest closed\-source models, such as Gemini 3\.1 Pro\(DeepMind,[2025](https://arxiv.org/html/2605.22064#bib.bib1)\)and GPT\-5\.5\(OpenAI,[2026](https://arxiv.org/html/2605.22064#bib.bib10)\)\. Prior research and model practices suggest that scaling up model size generally helps improve understanding, expression, and instruction\-following capabilities in complex translation scenarios\. However, representative large\-scale translation models, such as TransGemma\-27B, mostly adopt dense architectures, leading to high inference costs and making them less suitable for practical service deployment\. Therefore, Hy\-MT2 introduces a mixture\-of\-experts architecture and releases Hy\-MT2\-30B\-A3B to achieve a better balance between translation quality and inference efficiency\.

Finally, real\-world business deployment also exposed the limitations of Hy\-MT1\.5 in on\-device efficiency\. The 4\-bit quantized version of Hy\-MT1\.5\-1\.8B still requires more than 1GB of storage, and its inference speed is insufficient for some low\-latency translation scenarios\. To address this issue, Hy\-MT2 further explores ultra\-low\-bit quantization and implements 1\.25\-bit extreme quantization based on Hunyuan’s AngelSlim technology\. This version requires only about 440MB of storage for deployment and achieves a 1\.5× inference speedup over the 4\-bit quantized Hy\-MT1\.5 on Apple A15, significantly reducing on\-device deployment costs while improving inference efficiency\.

Overall, Hy\-MT2 systematically addresses the limitations of Hy\-MT1\.5 in domain\-specific translation, real\-world scenario translation, translation instruction following, the performance gap with the strongest closed\-source models, and efficient on\-device deployment\. It establishes a high\-quality, efficient, and multi\-capability multilingual translation model family that is better suited for real\-world applications\.

Table 1:Instruction examples for Hy\-MT2 translation tasks in Chinese and English\.TypeChinese promptEnglish promptDefault Translation将以下文本翻译为\{target\_lang\}，注意只需要输出翻译后的结果，不要额外解释：\{source\_text\}

Translate the following text into\{target\_lang\}\. Note that you shouldonly output the translated result without any additional explanation:\{source\_text\}

Terminology参考下面的翻译：
\{text\}翻译成\{text\}
\{text\}翻译成\{text\}
\{text\}翻译成\{text\}
将以下文本翻译为\{target\_lang\}，注意只需要输出翻译后的结果，不要额外解释：\{source\_text\}

Reference the following translations:
\{text\}translates to\{text\}
\{text\}translates to\{text\}
\{text\}translates to\{text\}Translate the following text into\{target\_lang\}\. Note that you mustONLY output the translated result without any additional explanation:\{source\_text\}

Style请将以下文本翻译为\{\{target\_lang\}\}。
注意翻译的风格要严格符合【\{\{target\_style\}\}】\{\{source\_text\}\}

Please translate the following text into\{\{target\_lang\}\}\. Note that the translation style must strictly conform to \[\{\{target\_style\}\}\]:\{\{source\_text\}\}

Personalization【待翻译文本】
\{source\_text\}【翻译任务】
1、\{user\_preferences\}
2、\{user\_preferences\}
3、……
4、将【待翻译文本】翻译为\{target\_lang\}。\[Source Text\]
\{source\_text\}\[Translation Tasks\]
1\.\{user\_preferences\}
2\.\{user\_preferences\}
3\. …
4\. Translate the \[Source Text\] into\{target\_lang\}\.Delimiters请将以下文本准确翻译为\{\{target\_lang\}\}。
你必须在译文中保留等量的分隔符，绝对不可遗漏、转义或翻译该符号，并注意分隔符的位置。\{\{source\_text\}\}

Please accurately translate the following text into\{\{target\_lang\}\}\.
You mustretain the exact same number of delimiters in the translation\. Strictly do not omit, escape, or translate these symbols, and pay close attention to their placement\.\{\{source\_text\}\}

Structured Data 1\# 任务目标
将下方\{\{source\_text\}\}中的\{\{format\_type\}\}格式数据翻译为\{\{target\_lang\}\}。\# 严格约束 1\.\*\*结构锁定\*\*: 绝对保持原有的\{\{format\_type\}\}数据结构、缩进和层级完全不变。 2\.\*\*选择性翻译\*\*: 仅翻译面向用户展示的可见文本内容。 3\.\*\*禁止修改\*\*: \*\*严禁\*\*翻译或更改任何代码标签、键名 \(Key\)、变量占位符 \(如 ‘\{\{var\}\}‘, ‘$\{var\}‘, ‘%s‘, ‘%d‘ 等\) 或代码属性。

\# 数据输入 \{\{source\_text\}\}

\#\#\# Task
Translate the user\-facing text within the following\{\{format\_type\}\}data into\{\{target\_lang\}\}\.\#\#\# Strict Rules 1\.\*\*Structure Preservation:\*\*You MUST preserve the original\{\{format\_type\}\}data structure, nesting, hierarchy, and indentation exactly as they are\. 2\.\*\*Selective Translation:\*\*Translate ONLY the visible, user\-facing text content/values\. 3\.\*\*Strict Non\-Translation:\*\*NEVER translate or alter code tags, keys, properties, object names, or variable placeholders\. Leave them exactly in their original English/code form\.

\#\#\# Source Data \{\{source\_text\}\}

Structured Data 2【背景信息】
\{\{background\_text\}\}请结合背景信息将以下文本翻译为\{\{target\_lang\}\}。【待翻译文本】 \{\{source\_text\}\}

\[Background Information\]
\{\{background\_text\}\}Please translate the following text into\{\{target\_lang\}\}, taking the provided background information into consideration\.\[Source Text\] \{\{source\_text\}\}

Notes\.This table shows representative instruction templates in Chinese and English\. Additional multilingual instruction examples are provided in Section[3\.6](https://arxiv.org/html/2605.22064#S3.SS6)\.

## 2Methodology

This section introduces the overall methodology of Hy\-MT2\. Designed for multilingual machine translation, Hy\-MT2 follows a staged pipeline consisting ofMT\-oriented Mid\-training\(Section[2\.1](https://arxiv.org/html/2605.22064#S2.SS1)\),Family\-Centric Post\-training\(FCPT；Section[2\.2](https://arxiv.org/html/2605.22064#S2.SS2)\), and modelQuantization\(Section[2\.3](https://arxiv.org/html/2605.22064#S2.SS3)\)\. Specifically, we start from a general Hy\-series Pretraining Model and perform MT\-oriented Mid\-training to obtain a unified model with fundamental translation capabilities\. The model is then further optimized through FCPT\. As shown in Figure[2](https://arxiv.org/html/2605.22064#S2.F2), FCPT consists of three key processes:Reference\-Guided On Policy Distillation\(RG\-OPD; Section[2\.2\.1](https://arxiv.org/html/2605.22064#S2.SS2.SSS1)\),Family\-specific RL Training\(Section[2\.2\.2](https://arxiv.org/html/2605.22064#S2.SS2.SSS2)\), andCross\-family On Policy Distillation\(Cross\-family OPD; Section[2\.2\.3](https://arxiv.org/html/2605.22064#S2.SS2.SSS3)\)\. The first two processes organize training around language families and construct multiple family\-specific strong teachers; Cross\-family OPD then transfers their capabilities into a unified student model while incorporating general instruction\-following data to preserve the model’s instruction\-following ability beyond translation\.

### 2\.1MT\-oriented Mid\-training

![Refer to caption](https://arxiv.org/html/2605.22064v1/x2.png)Figure 2:Family\-Centric Post\-training pipline of Hy\-MT2\.In theMT\-oriented Mid\-trainingstage, we start from the Hy\-series Pretraining Model and continue training it on approximately 1T tokens of large\-scale multilingual translation\-related data\. This stage aims to strengthen the model’s translation capability and provide a unified foundation for the subsequent Family\-Centric Post\-training\.

Specifically, the training data is organized along two dimensions:

- •Data format: We use both multilingual monolingual corpus and parallel translation corpus to help the model capture linguistic characteristics across different languages and strengthen cross\-lingual semantic mapping and source\-target alignment\.
- •Scenario coverage: The data covers general translation, domain\-specific translation, real\-world scenarios, and instruction\-following examples, improving translation quality, domain adaptation, practical translation robustness, and the ability to follow translation\-related instructions\.

The output of this stage is anMT\-oriented Mid\-trained Model, which serves as the unified starting point forFCPT\.

### 2\.2Family\-Centric Post\-training

Instead of directly mixing data from all language families, FCPT divides training into multiple family branches, covering diverse language groups, e\.g\., Western European, East Asian, and Middle Eastern right\-to\-left languages\. Within each branch, we incorporate general translation data, domain\-specific translation data, real\-world business scenario data, and translation instruction\-following data to construct a family\-specific teacher\. This family\-centric design allows each teacher to learn under a more consistent language distribution, reducing interference across different language families\.

#### 2\.2\.1Reference\-Guided On Policy Distillation

Reference\-Guided On Policy Distillationis the first stage of FCPT\. In this stage, we perform On policy distillation separately on each family branch, aiming to obtain a family\-specific translation policy that better captures the linguistic characteristics and translation preferences of the corresponding language family\. The resulting model further serves as a stronger initialization for subsequent Family\-specific RL Training\.

The core of RG\-OPD is the construction of a stronger Chimera Teacher\. Unlike conventional distillation methods that rely on a single teacher model, Chimera111The nameChimerais inspired by the mythological creature composed of multiple animals\. In our setting, it refers to a teacher signal constructed by fusing multiple reference sources\. We implement Chimera Teacher based on Hy3\-Preview\.\. Teacher does not require training an additional large\-scale translation\-specialized teacher model\. For each source sentence, it integrates candidate translations generated by multiple Hy\-series reference models with the original dataset label\. Although not all labels are manually annotated, they still serve as useful reference signals\. By fusing these multi\-source references, Chimera Teacher provides richer scoring signals, helps introduce greater diversity into the distillation process, and constructs a stronger supervision signal for On policy distillation\.

Specifically, given a source sentencexxand its reference setℛ\(x\)\\mathcal\{R\}\(x\), whereℛ\(x\)\\mathcal\{R\}\(x\)consists of multiple candidate reference sources, the student model in the current family branch first generates a translationyybased on its current policyπθ\\pi\_\{\\theta\}\. Chimera Teacher then evaluates the student output based on the multi\-source reference setℛ\(x\)\\mathcal\{R\}\(x\), and produces a teacher policy or target distributionπT\(⋅∣x,ℛ\(x\)\)\\pi\_\{T\}\(\\cdot\\mid x,\\mathcal\{R\}\(x\)\)\. The student model is optimized by minimizing the forward KL divergence from the teacher policy to the student policy\. The training objective can be written as:

ℒRG\-OPD=DKL\(πT\(⋅∣x,ℛ\(x\)\)∥πθ\(⋅∣x\)\)\.\\mathcal\{L\}\_\{\\text\{RG\-OPD\}\}=D\_\{\\text\{KL\}\}\\left\(\\pi\_\{T\}\(\\cdot\\mid x,\\mathcal\{R\}\(x\)\)\\parallel\\pi\_\{\\theta\}\(\\cdot\\mid x\)\\right\)\.\(1\)Here,πT\\pi\_\{T\}denotes the distillation target distribution constructed by Chimera Teacher, andπθ\\pi\_\{\\theta\}denotes the output policy of the current student model\. We adopt forward KL divergence as the distillation objective, enabling the student model to learn the fused translation preference from Chimera Teacher in an online manner and gradually improve its translation policy for the corresponding language family\.

AfterRG\-OPD, each family branch obtains a family\-specific student model that has been adapted to the translation preferences and expression patterns of the corresponding language family\. These models are then used as the initialization for subsequentFamily\-specific RL Training\.

#### 2\.2\.2Family\-specific RL Training

InFamily\-specific RL Training, each family branch is further optimized through Group Relative Policy Optimization \(GRPO\)\(Shao et al\.,[2024](https://arxiv.org/html/2605.22064#bib.bib12)\), using the model obtained from RG\-OPD as initialization\. To provide more fine\-grained and rigorous reward signals, we introduce a hybrid evaluation system combining a rule\-based pre\-filter with an LLM\-based Multidimensional Quality Metrics \(MQM\) judge\(Lommel et al\.,[2014](https://arxiv.org/html/2605.22064#bib.bib9); Freitag et al\.,[2021](https://arxiv.org/html/2605.22064#bib.bib3)\)\.

Rule\-based Pre\-filteringBefore passing translations to the LLM evaluator, a rule\-based filter is applied to intercept critical text degradation\. Translations exhibiting severe repetition or mixed languages are immediately assigned a reward of 0\. This ensures early penalization of degenerated outputs and avoids unnecessary LLM computation\.

LLM Judge Evaluation SystemFor translations that pass the pre\-filter, the LLM\-based judge evaluates the text based on a 5\-dimensional error typology rather than assigning a holistic score\. The dimensions are:

- •Terminology: Identifies terminology errors, inconsistencies with terminological resources, or inconsistent usage throughout the text\.
- •Accuracy: Detects mistranslations, over\-translation, under\-translation, added or omitted content, unwarranted translations, missed translations, and instances of mixed languages\.
- •Linguistic Conventions: Checks for grammatical errors, punctuation errors, spelling mistakes, unintelligibility, discourse convention errors, and locale convention violations\.
- •Style: Assesses inconsistencies with external references, incorrect language register, obscure expressions, unnatural phrasing, and stylistic inconsistencies\.
- •Instruction Following: Evaluates adherence to task constraints, flagging wrong languages, unexecuted translation tasks, and failures to follow terminology, formatting, style, or context guidelines\.

Scoring RulesThe evaluation starts from a base score of 100\. The LLM judge identifies translation errors and applies deductions according to their severity\. Fatal errors, such as using the wrong language or failing to execute the translation task, directly result in an overall score of 0\. Major errors incur a deduction of 10–20 points per instance, while minor errors incur a deduction of 2–5 points per instance\. The overall scoreSoverallS\_\{\\text\{overall\}\}is obtained by subtracting all error deductions from the base score, with a lower bound of 0\.

Length Penalty and Final Reward CalculationTo prevent the model from exploiting the reward system by generating pathologically short, truncated, or excessively long and redundant sentences, we introduce a length penalty\. Given a source sentencexx, a ground\-truth translation of lengthLgtL\_\{gt\}, and a model\-generated translationyyof lengthLyL\_\{y\}, the length penaltyPlenP\_\{\\text\{len\}\}is computed as:

Plen=min⁡\(0\.5×\|Lgt−Ly\|Lgt,0\.5\)\.P\_\{\\text\{len\}\}=\\min\\left\(0\.5\\times\\frac\{\|L\_\{gt\}\-L\_\{y\}\|\}\{L\_\{gt\}\},0\.5\\right\)\.\(2\)
The final rewardr\(x,y\)r\(x,y\)is calculated by normalizing the overall MQM score to the range\[0,1\]\[0,1\], subtracting the length penaltyPlenP\_\{\\text\{len\}\}, and clipping the result at 0\.

After this stage, each family branch produces a strong expert, which is then used as a strong teacher in the subsequentCross\-family OPDstage\.

#### 2\.2\.3Cross\-family On Policy Distillation

Cross\-family On Policy Distillationis the final training process of FCPT, aiming to transfer the language\-family\-specific translation capabilities learned by multiple family\-specific strong teachers into a unified student model\. Meanwhile, to improve the model’s instruction\-following ability, we introduce general instruction\-following data in this stage and use the Hy Instruct model as the corresponding instruction teacher to provide distillation signals\. Therefore, Cross\-family OPD can be viewed as a unified multi\-teacher distillation process\.

Specifically, given an input samplexx, we denote its teacher policy asπT\(⋅∣x\)\\pi\_\{T\}\(\\cdot\\mid x\)and the output policy of the unified student model asπθ\(⋅∣x\)\\pi\_\{\\theta\}\(\\cdot\\mid x\)\. In this stage, we adopt reverse KL divergence as the distillation objective, which is defined as:

ℒCross\-OPD=DKL\(πθ\(⋅∣x\)∥πTτ\(x\)\(⋅∣x\)\),\\mathcal\{L\}\_\{\\text\{Cross\-OPD\}\}=D\_\{\\text\{KL\}\}\\Big\(\\pi\_\{\\theta\}\(\\cdot\\mid x\)\\parallel\\pi\_\{T\_\{\\tau\(x\)\}\}\(\\cdot\\mid x\)\\Big\),\(3\)whereπθ\(⋅∣x\)\\pi\_\{\\theta\}\(\\cdot\\mid x\)denotes the output policy of the student model initialized from MT\-oriented Mid\-training, andπTτ\(x\)\(⋅∣x\)\\pi\_\{T\_\{\\tau\(x\)\}\}\(\\cdot\\mid x\)denotes the output policy of the selected teacher\. For translation samples,τ\(x\)\\tau\(x\)selects the corresponding family\-specific strong teacher according to the language family ofxx; for general instruction\-following samples,τ\(x\)\\tau\(x\)selects the Hy Instruct teacher\.

AfterCross\-family On Policy Distillation, we obtain the finalHy\-MT2series models\. To facilitate efficient deployment, these models are further processed in the quantization stage\.

### 2\.3Quantization

To accommodate deployment requirements under varying resource constraints, we perform model quantization on the obtained Hy\-MT2 model series, offering a diverse suite of precision variants including FP16, 8\-bit, 4\-bit, 2\-bit, and 1\.25\-bit\.

For the 8\-bit and 4\-bit variants, we predominantly adopt a post\-training quantization \(PTQ\) pipeline\. Without retraining the model, this approach estimates the distribution of model weights or activations using a small set of calibration data, thereby reducing storage and computational overheads\. Specifically, 8\-bit quantization employs a higher\-precision low\-bit representation to minimize performance degradation, whereas 4\-bit quantization further compresses weight representations and mitigates quantization errors through dedicated calibration strategies\.

For the 2\-bit version, we employ the ultra\-low\-bit quantization\-aware training \(QAT\) scheme from AngelSlimTeam \([2026](https://arxiv.org/html/2605.22064#bib.bib13)\)framework\. Compared with PTQ, 2\-bit quantization imposes more stringent representational constraints\. Consequently, it requires explicit simulation of low\-bit quantization behaviors during the training process, allowing model weights to progressively adapt to low\-precision representations\. The 2\-bit scheme utilizes Stretched Elastic Quantization \(SEQ\)Liu et al\. \([2026](https://arxiv.org/html/2605.22064#bib.bib8)\), quantizing weight into \-1\.5, \-0\.5, \+0\.5, \+1\.5\. By optimizing the quantization mapping and scaling factors, this approach enhances the model’s stability and capability for performance recovery under 2\-bit constraints\.

For extreme compression scenarios, we further implement SherryHuang et al\. \([2026](https://arxiv.org/html/2605.22064#bib.bib5)\), a 1\.25\-bit sparse ternary quantization method\. Sherry quantizes model weights into a ternary space of\{−1,0,\+1\}\\\{\-1,0,\+1\\\}and introduces a 3:4 fine\-grained sparsity pattern, where constraining every 4\-weight block to contain exactly one zero and three sign\-only weights\. This structure enables packing four weights into 5 bits, thereby achieving a regularized 1\.25\-bit representation while maintaining a hardware alignment that is more favorable for Single Instruction Multiple Data \(SIMD\) computing patterns\. Compared to conventional 2\-bit packing or 1\.67\-bit irregular packing, Sherry achieves a superior balance between compression ratio and inference efficiency\.

For 1\.25\-bit and 2\-bit QAT, we adopt a distillation\-based training strategy\. Specifically, the low\-precision model serves as the student, while the high\-precision model acts as the teacher\. The overall objective combines the standard language modeling \(LM\) loss with a KL\-divergence\-based distillation loss\. For the KL component, inspired byDu et al\. \([2024](https://arxiv.org/html/2605.22064#bib.bib2)\), we incorporate both forward and backward KL divergences\. Unlike prior approaches that use a fixed, manually selected weighting coefficient, we compute the teacher model’s confidence score for each token and dynamically adjust the relative weights of the forward and backward KL terms accordingly\.

Ultimately, we obtain the quantized Hy\-MT2 model series spanning FP16, 8\-bit, 4\-bit, 2\-bit, and 1\.25\-bit precisions\. Armed with these diverse precision variants, Hy\-MT2 can be flexibly deployed across a wide range of scenarios, including high\-accuracy service serving, low\-resource device inference, and extreme edge\-side compression\.

## 3Experiments

### 3\.1Benchmarks

To comprehensively evaluate the translation capabilities of Hy\-MT2, we construct an evaluation suite from four perspectives: general translation, real\-world business scenario translation, domain\-specific translation, and translation instruction following\.

General translation evaluation\.We use FLORES\-200\(Team et al\.,[2022](https://arxiv.org/html/2605.22064#bib.bib14)\), WMT25\(Kocmi et al\.,[2025](https://arxiv.org/html/2605.22064#bib.bib6)\), and the Mandarin⇔\\LeftrightarrowMinority Testset to evaluate general translation capabilities\. FLORES\-200 covers 1,056 translation directions across 33 languages\. WMT25 adopts the human evaluation sets from WMT25 and covers 12 translation directions\. The Mandarin⇔\\LeftrightarrowMinority Testset focuses on bidirectional translation between Mandarin Chinese and minority languages\.

Real\-world business scenario evaluation\.We construct WildMTBench to assess model performance on practical business inputs\. The dataset covers six types of scenarios, including webpages, meetings, books, social content, news, and documents, with 2,000 samples in total\. It focuses on evaluating model robustness and adaptability to diverse text forms, real\-world input distributions, and complex application requirements\.

Domain\-specific translation evaluation\.We construct DomainMTBench to assess translation quality in professional domains\. The dataset covers six domains: finance, law, politics, technology, medicine, and education\. The data are collected from open\-source corpora and processed through cleaning, filtering, domain classification, and human translation annotation, resulting in 24,000 samples\. This benchmark focuses on evaluating the model’s ability to handle domain terminology, specialized expressions, and established industry translations\.

Instruction\-following evaluation\.We construct IFMTBench to evaluate translation\-specific instruction following in multilingual settings\. It contains 7,344 high\-quality human\-aligned samples with instructions in Chinese, German, Japanese, French, English, Spanish, and Korean, covering industrial translation constraints such as terminology, format, and style\. It includes 4,506 single\-constraint and 2,838 multi\-constraint samples, evaluating both basic constraint execution and robustness under complex instruction combinations\. In addition, we use IFBench\(Pyatkin et al\.,[2025](https://arxiv.org/html/2605.22064#bib.bib11)\), IFEval\(Zhou et al\.,[2023](https://arxiv.org/html/2605.22064#bib.bib16)\), MaXIFE\(Liu et al\.,[2025](https://arxiv.org/html/2605.22064#bib.bib7)\), and Multi\-IF\(He et al\.,[2024](https://arxiv.org/html/2605.22064#bib.bib4)\)to assess general instruction\-following capabilities\. IFEval is a verifiable instruction\-following benchmark for large language models, containing around 500 prompts\. IFBench focuses on model generalization to diverse and unseen verifiable constraints\. MaXIFE evaluates multilingual and cross\-lingual instruction following across 23 languages\. Multi\-IF focuses on instruction following in multi\-turn and multilingual interactions\.

Table 2:Performance comparison on general translation benchmarks\.ModelsFLORES\-200WMT25Mand\.⇔\\LeftrightarrowMin\.ZH⇔\\LeftrightarrowXXEN⇔\\LeftrightarrowXXXX⇔\\LeftrightarrowXXGemini 3\.1 ProT90\.30/ 78\.96 /92\.1494\.42/88\.38/ 92\.6888\.74/77\.60/90\.9757\.58/ 69\.06 / 82\.2361\.11/ 53\.50 /79\.70GPT\-5\.5T89\.94 / 78\.95 / 91\.9894\.16 / 88\.33 /92\.7688\.44 / 77\.36 / 90\.9356\.68 /69\.31/ 83\.2960\.87 /55\.18/ 79\.68GPT\-5\.589\.60 /78\.96/ 91\.6593\.98 / 88\.17 / 92\.5487\.92 / 77\.20 / 90\.3756\.41 / 69\.15 /83\.4159\.81 / 53\.32 / 75\.16DeepSeek\-V4\-ProT88\.86 / 78\.13 / 90\.9793\.22 / 87\.83 / 91\.7386\.81 / 76\.66 / 89\.8654\.83 / 68\.25 / 81\.9956\.31 / 51\.24 / 69\.48DeepSeek\-V4\-Pro88\.60 / 78\.11 / 90\.1593\.02 / 87\.71 / 91\.3983\.29 / 67\.64 / 75\.4653\.97 / 67\.58 / 79\.0955\.61 / 51\.47 / 66\.02Kimi K2\.6T88\.49 / 77\.97 / 90\.8492\.95 / 87\.72 / 91\.2886\.17 / 76\.72 / 89\.4454\.48 / 68\.64 / 81\.6855\.98 / 54\.76 / 68\.73Kimi K2\.687\.68 / 77\.55 / 89\.8591\.96 / 86\.95 / 90\.3184\.05 / 75\.84 / 87\.4349\.76 / 66\.08 / 76\.8454\.29 / 52\.52 / 64\.00GLM5\.1T88\.78 / 78\.28 / 91\.3893\.26 / 87\.82 / 91\.8486\.57 / 76\.84 / 89\.9254\.23 / 68\.39 / 80\.7157\.51 / 54\.46 / 71\.72GLM5\.187\.09 / 77\.60 / 89\.4491\.59 / 86\.73 / 89\.9283\.83 / 75\.40 / 87\.1449\.71 / 65\.41 / 73\.7056\.00 / 50\.68 / 67\.70Qwen3\.5\-397B\-A17BT87\.79 / 77\.75 / 90\.5892\.38 / 87\.19 / 90\.6586\.06 / 76\.52 / 89\.2854\.95 / 68\.63 / 81\.3755\.97 / 51\.89 / 70\.21Qwen3\.5\-397B\-A17B88\.50 / 78\.64 / 90\.6693\.07 / 87\.73 / 91\.4986\.29 / 76\.39 / 88\.8755\.79 / 68\.80 / 83\.1455\.59 / 54\.44 / 67\.45Qwen3\.6\-35B\-A3BT87\.71 / 77\.93 / 90\.3292\.32 / 87\.55 / 90\.4984\.84 / 75\.95 / 88\.2150\.75 / 65\.24 / 73\.4254\.07 / 53\.37 / 64\.44Gemma4\-31BT89\.30 / 78\.86 / 91\.1693\.79 / 88\.27 /91\.9087\.77 / 77\.28 /90\.1154\.48 / 67\.12 / 79\.5951\.99 / 51\.49 / 64\.17Gemma4\-26B\-A4BT88\.68 / 78\.68 / 90\.7493\.31 / 88\.09 / 91\.5386\.80 / 77\.14 / 89\.5852\.13 / 65\.40 / 77\.7347\.83 / 49\.19 / 56\.73Gemma4\-E4BT83\.49 / 75\.94 / 83\.9689\.97 / 85\.90 / 87\.6781\.29 / 74\.10 / 83\.7439\.70 / 56\.48 / 62\.3441\.16 / 40\.64 / 42\.25Gemma4\-E2BT83\.21 / 76\.18 / 84\.2188\.94 / 85\.38 / 86\.3979\.69 / 73\.78 / 82\.5437\.07 / 53\.51 / 57\.2140\.44 / 40\.39 / 42\.67Qwen3\.6\-35B\-A3B86\.32 / 78\.00 / 87\.3490\.87 / 86\.52 / 88\.7782\.11 / 74\.81 / 84\.9251\.11 / 66\.70 / 75\.2852\.12 /51\.74/ 58\.28Gemma4\-31B88\.40 / 78\.68 / 89\.8993\.30 / 87\.96 /91\.0386\.84 / 76\.16 / 88\.6752\.49 / 65\.66 / 75\.7548\.98 / 50\.05 / 56\.75Gemma4\-26B\-A4B87\.61 / 78\.31 / 88\.9792\.80 / 87\.71 / 90\.6585\.60 / 76\.58 / 87\.9749\.62 / 64\.22 / 74\.3644\.52 / 46\.14 / 47\.70Gemma4\-E4B83\.36 / 75\.92 / 83\.5289\.54 / 85\.61 / 86\.9579\.96 / 72\.91 / 80\.9738\.36 / 55\.45 / 59\.9541\.17 / 40\.23 / 43\.19Gemma4\-E2B79\.78 / 73\.19 / 76\.4687\.20 / 84\.00 / 83\.6375\.97 / 71\.00 / 76\.6435\.60 / 51\.83 / 54\.3539\.07 / 33\.83 / 34\.62Tower\-Plus\-72B79\.69 / 71\.82 / 77\.8684\.16 / 81\.13 / 78\.8270\.02 / 65\.53 / 67\.8541\.00 / 55\.54 / 54\.8538\.55 / 35\.40 / 26\.70translategemma\-27b\-it– / – / –– / – / –– / – / –58\.02 / 66\.08 / 81\.24– / – / –Microsoft\-Translator82\.95 / 72\.85 / 84\.7989\.31 / 85\.48 / 87\.3480\.13 / 72\.42 / 83\.8542\.01 / 60\.21 / 67\.6351\.80 / 41\.77 / 55\.27Doubao\-Translator80\.92 / 71\.57 / 82\.1486\.77 / 83\.49 / 84\.4676\.54 / 69\.92 / 79\.7233\.13 / 54\.52 / 49\.7553\.11∗/ 40\.16∗/ 66\.81∗iFLYTEK\-Translator83\.00 / 73\.96 / –87\.18 / 83\.65 / –76\.53 / 69\.90 / –35\.10 / 56\.15 / –59\.04 / 44\.67 / –HY\-MT1\.5\-1\.8B83\.61 / 76\.55 / 76\.7989\.42 / 84\.11 / 81\.3578\.40 / 71\.82 / 75\.1253\.08 / 61\.95 / 63\.5858\.06 / 40\.84 / 61\.66HY\-MT1\.5\-7B86\.90 /79\.24/ 84\.3590\.93 / 86\.50 / 85\.6680\.98 / 73\.36 / 78\.3061\.59 / 68\.85 / 75\.9161\.74 / 44\.55 / 67\.26Hy\-MT2\-1\.8B84\.26 / 76\.43 / 82\.7190\.00 / 85\.46 / 84\.0579\.77 / 73\.41 / 78\.6450\.30 / 64\.59 / 70\.3658\.29 / 39\.22 / 62\.36Hy\-MT2\-7B89\.45 / 78\.97 / 88\.8993\.52 / 88\.07 / 89\.3486\.89 / 76\.03 / 87\.2363\.86/71\.21/ 82\.2462\.05 / 43\.60 / 68\.93Hy\-MT2\-30B\-A3B89\.83/ 79\.03 /90\.2693\.85/88\.23/ 90\.8987\.47/76\.34/88\.7962\.89 / 71\.08 /84\.3462\.44/ 42\.37 /69\.43

Notes\.Each cell reports XCOMET\-XXL / CometKiwi / GEMBA scores, and all scores are multiplied by 100\.Tdenotes thinking mode\. In FLORES\-200, XX⇔\\LeftrightarrowXX denotes the average performance over all evaluated translation directions, including ZH⇔\\LeftrightarrowXX and EN⇔\\LeftrightarrowXX\. Mand\.⇔\\LeftrightarrowMin\. denotes Mandarin⇔\\LeftrightarrowMinority translation\. Values marked with∗are computed only on supported language pairs\. Values replaced by – indicate that the model does not support the corresponding test set\. Baselines are grouped intolarge\-scale models and all Think\-mode models,medium to small\-sized general models in non\-Think mode, andtranslation\-specialized models\. Our models are shown in bold\. The best results among large\-scale models and all Think\-mode models in each column are highlighted inblue background, while the best results among non\-thinking\-mode, small\-to\-medium\-sized models are highlighted inlight orange background\.

### 3\.2General Translation Evaluation

In general translation evaluation, we use three metrics: XCOMET\-XXL, CometKiwi, and GEMBA\. XCOMET\-XXL is a reference\-based automatic evaluation metric, CometKiwi is a reference\-free evaluation metric, and GEMBA is an LLM\-based evaluation metric\. The results are reported in Table[2](https://arxiv.org/html/2605.22064#S3.T2)\.

Overall, Hy\-MT2 achieves substantial improvements over Hy\-MT1\.5 across general translation benchmarks\. On the XX⇔\\LeftrightarrowXX setting of FLORES\-200, Hy\-MT2\-1\.8B, Hy\-MT2\-7B, and Hy\-MT2\-30B\-A3B reach 79\.77, 86\.89, and 87\.47, corresponding to 89\.9%, 97\.9%, and 98\.6% of Gemini 3\.1 ProT, respectively\. In particular, Hy\-MT2\-7B and Hy\-MT2\-30B\-A3B outperform strong baselines such as DeepSeek\-V4\-Pro, Kimi K2\.6, Qwen3\.5\-397B\-A17B, and Gemma4\-26B\-A4B in this setting\. Compared with Hy\-MT1\.5\-7B, Hy\-MT2\-7B improves the XCOMET\-XXL score from 80\.98 to 86\.89, showing a clear gain in overall multilingual translation performance\.

On WMT25, Hy\-MT2\-7B and Hy\-MT2\-30B\-A3B also show strong performance\. Hy\-MT2\-7B achieves 63\.86 / 71\.21 / 82\.24, while Hy\-MT2\-30B\-A3B achieves 62\.89 / 71\.08 / 84\.34\. Compared with Hy\-MT1\.5\-7B, whose scores are 61\.59 / 68\.85 / 75\.91, Hy\-MT2\-7B improves on all three metrics, with a particularly large gain on GEMBA\. Hy\-MT2\-30B\-A3B further achieves the best GEMBA score among all compared systems, surpassing Gemini 3\.1 ProTand GPT\-5\.5, indicating stronger overall translation quality and readability in challenging WMT settings\.

On Mandarin⇔\\LeftrightarrowMinority translation, Hy\-MT2\-7B and Hy\-MT2\-30B\-A3B achieve XCOMET\-XXL scores of 62\.05 and 62\.44, respectively, outperforming both Gemini 3\.1 Pro and Hy\-MT1\.5\-7B\. This suggests that Hy\-MT2 preserves strong performance in Mandarin\-minority language translation and further improves translation quality in low\-resource language scenarios\.

For the lightweight setting, Hy\-MT2\-1\.8B shows consistent improvements over Hy\-MT1\.5\-1\.8B and remains highly competitive against larger open\-source models and commercial translation systems\. Despite its small size, it outperforms Tower\-Plus\-72B and achieves competitive results against Microsoft Translator and Doubao Translator\. On WMT25, Hy\-MT2\-1\.8B surpasses both commercial systems across all three metrics, demonstrating a strong efficiency\-quality trade\-off\.

### 3\.3Domain\-Specific and In\-the\-Wild Translation Evaluation

Table 3:Performance comparison on DomainMTBench and WildMTBench\.ModelsDomainMTBenchWildMTBenchFinanceLawMedicalTechnologyPoliticsEducationAvg\.Gemini 3\.1 ProT96\.30/94\.7388\.48/93\.2595\.64/95\.4494\.94/95\.2396\.16/94\.5897\.24/95\.1194\.50/94\.6486\.62 /88\.96GPT\-5\.5T96\.07 / 94\.6988\.13 / 93\.0895\.40 / 95\.3594\.58 / 94\.8995\.93 / 94\.4797\.06 / 95\.0894\.23 / 94\.5186\.66/ 88\.72GPT\-5\.595\.93 / 94\.5688\.26 / 92\.4895\.37 / 95\.2494\.49 / 94\.7895\.83 / 94\.3197\.01 / 94\.8194\.20 / 94\.2886\.51 / 88\.00DeepSeek\-V4\-ProT95\.76 / 94\.4787\.65 / 92\.2295\.33 / 95\.2194\.40 / 94\.7195\.55 / 94\.0296\.96 / 94\.7793\.96 / 94\.1385\.77 / 88\.23DeepSeek\-V4\-Pro95\.81 / 94\.2087\.55 / 91\.7695\.42 / 95\.0894\.33 / 94\.2695\.61 / 93\.7896\.72 / 94\.1693\.96 / 93\.8086\.05 / 85\.81Kimi K2\.6T95\.91 / 94\.3187\.81 / 92\.5195\.34 / 94\.9694\.42 / 94\.3795\.68 / 93\.9696\.86 / 94\.3694\.04 / 94\.0186\.39 / 87\.91Kimi K2\.695\.30 / 93\.7487\.30 / 91\.0794\.95 / 94\.4293\.98 / 93\.6895\.18 / 93\.4396\.66 / 93\.7593\.58 / 93\.2785\.88 / 87\.09GLM5\.1T95\.77 / 94\.5587\.72 / 92\.5495\.08 / 95\.2194\.40 / 94\.8795\.60 / 94\.2996\.82 / 94\.8693\.92 / 94\.3086\.10 / 88\.29GLM5\.195\.31 / 94\.0686\.84 / 91\.7694\.95 / 94\.8593\.96 / 94\.1895\.22 / 93\.7196\.54 / 94\.3893\.49 / 93\.7285\.47 / 86\.83Qwen3\.5\-397B\-A17BT94\.21 / 92\.7686\.11 / 91\.0193\.12 / 92\.8692\.34 / 92\.4194\.11 / 92\.4894\.07 / 91\.8592\.10 / 92\.2082\.56 / 83\.54Qwen3\.5\-397B\-A17B95\.88 / 94\.3287\.59 / 91\.6395\.24 / 94\.9494\.38 / 94\.3595\.64 / 93\.8097\.13 / 94\.6093\.98 / 93\.8286\.97 / 87\.56Qwen3\.6\-35B\-A3BT96\.03 / 94\.3187\.58 / 91\.6695\.35 / 94\.8794\.53 / 94\.2395\.76 / 93\.7696\.91 / 94\.3394\.06 / 93\.7787\.24 / 87\.75Gemma4\-31BT96\.07 / 94\.3287\.66 / 92\.1395\.37 / 94\.8994\.15 / 94\.2795\.89 / 93\.8097\.12 / 94\.5594\.07 / 93\.9086\.87 / 87\.33Gemma4\-26B\-A4BT95\.78 / 94\.3087\.44 / 91\.8195\.19 / 94\.8394\.11 / 94\.1395\.50 / 93\.6896\.92 / 94\.2993\.84 / 93\.7686\.30 / 86\.03Gemma4\-E4BT93\.66 / 91\.6083\.79 / 87\.0393\.61 / 92\.1992\.27 / 91\.0293\.10 / 90\.4395\.19 / 91\.8691\.57 / 90\.5384\.11 / 84\.64Gemma4\-E2BT92\.77 / 91\.2582\.69 / 86\.7793\.12 / 91\.6791\.94 / 90\.5692\.41 / 89\.7794\.94 / 91\.2590\.90 / 90\.0583\.12 / 82\.85Qwen3\.6\-35B\-A3B95\.52 / 93\.5387\.00 / 90\.1394\.88 / 94\.2694\.13 / 93\.5595\.29 / 93\.2796\.57 / 93\.5793\.59 / 92\.9686\.49 / 86\.63Gemma4\-31B95\.56 / 93\.8187\.10 / 91\.2895\.12 / 94\.4494\.02 / 93\.9795\.28 / 93\.1496\.77 / 94\.1793\.65 / 93\.3486\.35 / 86\.55Gemma4\-26B\-A4B95\.39 / 93\.6586\.63 / 90\.3595\.02 / 94\.2593\.69 / 93\.4494\.97 / 93\.0696\.69 / 94\.0393\.39 / 93\.0186\.26 / 86\.69Gemma4\-E4B93\.45 / 91\.4283\.46 / 86\.4693\.51 / 92\.1091\.95 / 90\.4093\.13 / 90\.4895\.13 / 91\.5291\.41 / 90\.2783\.45 / 84\.18Gemma4\-E2B92\.07 / 89\.5981\.52 / 84\.5192\.53 / 89\.7391\.03 / 88\.2691\.61 / 87\.8294\.30 / 90\.4590\.09 / 88\.1182\.01 / 81\.58Tower\-Plus\-72B95\.23 / 93\.6086\.54 / 90\.8095\.04 / 94\.3593\.91 / 93\.6295\.41 / 93\.2996\.53 / 93\.8393\.46 / 93\.1485\.13 / 86\.66TranslateGemma\-27B\-IT96\.33 / 92\.3887\.56 / 88\.8895\.81 / 93\.8494\.66 / 92\.2996\.00 / 92\.0197\.26 / 93\.2094\.30 / 91\.9488\.44 / 85\.65Microsoft\-Translator92\.51 / 90\.3481\.76 / 85\.1792\.47 / 90\.6491\.19 / 87\.9692\.77 / 90\.2594\.96 / 90\.8890\.49 / 89\.0179\.19 / 79\.32Doubao\-Translator93\.39 / 91\.8083\.18 / 87\.5691\.63 / 90\.5391\.78 / 89\.6493\.77 / 91\.9794\.69 / 91\.3891\.02 / 90\.3678\.07 / 77\.61HY\-MT1\.5\-1\.8B95\.55 / 90\.0186\.03 / 84\.9095\.26 / 90\.2794\.37 / 88\.5995\.23 / 89\.5996\.61 / 90\.8993\.52 / 88\.8287\.41 / 80\.84HY\-MT1\.5\-7B96\.66 / 92\.8288\.37 / 89\.5296\.28/ 93\.4095\.24/ 91\.5296\.51 / 92\.5597\.57/ 93\.0994\.82 / 92\.0490\.21 / 87\.13Hy\-MT2\-1\.8B95\.36 / 91\.6986\.63 / 88\.8994\.97 / 92\.1093\.97 / 91\.2595\.01 / 91\.1896\.42 / 92\.4693\.41 / 91\.0887\.43 / 86\.04Hy\-MT2\-7B96\.79 / 93\.1489\.02 / 91\.1496\.23 / 93\.7695\.15 / 92\.8096\.47 / 93\.0297\.50 / 93\.4394\.92 / 92\.7990\.28/ 88\.93Hy\-MT2\-30B\-A3B97\.08/94\.1489\.15/92\.0496\.22 /94\.6395\.16 /94\.0896\.63/93\.7997\.54 /94\.2395\.04/93\.7389\.87 /89\.25

Notes\.Each cell reports XCOMET / GEMBA scores, and all scores are multiplied by 100\. Avg\. denotes the overall average score on DomainMTBench\.Tdenotes thinking mode; for models with both modes, the row withoutTdenotes non\-thinking mode\. Baselines are grouped intolarge\-scale models and all Think\-mode models,medium to small\-sized general models in non\-Think mode, andtranslation\-specialized models\. Our models are shown in bold\. The best results among large\-scale models and all Think\-mode models in each column are highlighted inblue background, while the best results among non\-thinking\-mode, small\-to\-medium\-sized models are highlighted inlight orange background\.

Domain\-specific translation\.In domain\-specific translation evaluation, we use DomainMTBench to assess model performance across multiple professional domains\. We report XCOMET and GEMBA scores, where XCOMET measures reference\-based translation quality and GEMBA provides an LLM\-based assessment of overall translation quality\. The results are shown in Table[3](https://arxiv.org/html/2605.22064#S3.T3)\.

Overall, Hy\-MT2 shows strong performance on DomainMTBench and consistently improves over Hy\-MT1\.5 in GEMBA\. On the average score, Hy\-MT2\-1\.8B improves the GEMBA score from 88\.82 to 91\.08 compared with Hy\-MT1\.5\-1\.8B, while Hy\-MT2\-7B improves from 92\.04 to 92\.79 compared with Hy\-MT1\.5\-7B\. Hy\-MT2\-30B\-A3B further achieves 95\.04 / 93\.73 on the average score, obtaining the best XCOMET result among all compared systems and the best GEMBA result among open\-source and translation\-specialized models\. Across individual domains, Hy\-MT2\-30B\-A3B achieves the best XCOMET scores in finance, law, and politics, with 97\.08, 89\.15, and 96\.63, respectively\. It also remains highly competitive in medical, technology, and education\. These results indicate that the larger MoE model effectively strengthens domain\-specific translation ability while maintaining strong performance across diverse professional domains\. Hy\-MT2\-7B also demonstrates strong domain translation quality, improving over Hy\-MT1\.5\-7B on the average score from 94\.82 / 92\.04 to 94\.92 / 92\.79, with clear gains on GEMBA across multiple domains\. For the lightweight model, Hy\-MT2\-1\.8B achieves a notable GEMBA improvement over Hy\-MT1\.5\-1\.8B, increasing the average score from 88\.82 to 91\.08\. It also surpasses commercial systems such as Microsoft Translator and Doubao Translator on the average score, especially in GEMBA\.

In\-the\-wild translation\.On WildMTBench, Hy\-MT2 also shows clear advantages in real\-world translation scenarios\. Hy\-MT2\-7B achieves 90\.28 / 88\.93, outperforming Hy\-MT1\.5\-7B on both XCOMET and GEMBA\. Hy\-MT2\-30B\-A3B further reaches 89\.87 / 89\.25, obtaining the best GEMBA score among all compared systems and surpassing Gemini 3\.1 Pro in LLM\-based evaluation\. For the lightweight setting, Hy\-MT2\-1\.8B improves substantially over Hy\-MT1\.5\-1\.8B in GEMBA, increasing from 80\.84 to 86\.04, while maintaining a similar XCOMET score\. It also clearly outperforms commercial translation systems such as Microsoft Translator and Doubao Translator on WildMTBench\.

Overall, these results show that Hy\-MT2 improves not only standard domain\-specific translation, but also robustness and usability in real\-world business scenarios\. The consistent gains over Hy\-MT1\.5, especially in GEMBA, suggest that Hy\-MT2 produces more natural and reliable translations under both professional\-domain and in\-the\-wild settings\.

### 3\.4Instruction\-Following Evaluation

Table 4:Performance comparison on instruction\-following benchmarks\.ModelsIFBenchIFEvalMaXIFEMulti\-IFIFMTBenchLooseStrictOverallSimpleComplexTotalGemini 3\.1 ProT71\.3396\.3090\.5887\.6289\.1095\.02/89\.57/84\.8991\.9584\.5389\.08GPT\-5\.5T67\.3393\.9089\.9684\.5887\.2794\.90 / 89\.33 / 84\.4689\.2584\.4487\.39GPT\-5\.543\.3391\.6887\.0481\.0484\.0493\.25 / 87\.21 / 82\.0286\.9783\.7485\.72DeepSeek\-V4\-ProT76\.0091\.3188\.5384\.1486\.3493\.62 / 87\.02 / 81\.6986\.9783\.7485\.72DeepSeek\-V4\-Pro42\.6789\.8384\.9878\.7781\.8890\.36 / 83\.45 / 76\.6081\.3878\.2380\.16Kimi K2\.6T68\.6795\.5689\.1985\.8087\.5093\.87 / 87\.15 / 82\.2983\.1780\.8182\.26Kimi K2\.638\.0090\.0286\.4080\.4983\.4590\.60 / 84\.93 / 77\.4888\.0779\.9284\.92GLM5\.1T74\.0094\.0988\.5985\.6187\.1093\.94 / 86\.47 / 82\.7788\.8182\.1486\.23GLM5\.150\.0090\.5786\.4080\.4983\.4592\.24 / 85\.25 / 79\.6785\.5378\.8382\.94Qwen3\.5\-397B\-A17BT65\.6789\.8389\.8886\.8088\.3490\.81 / 84\.17 / 79\.5685\.0572\.0880\.04Qwen3\.5\-397B\-A17B48\.3388\.5486\.2680\.4683\.3690\.50 / 81\.62 / 75\.8681\.7177\.6180\.13Gemma4\-E2B26\.0080\.4477\.3268\.7973\.0680\.79/70\.59/63\.8162\.9349\.1057\.59Hy\-MT2\-1\.8B35\.3380\.2261\.2951\.0556\.1770\.50 / 49\.79 / 35\.7576\.7657\.6169\.36Gemma4\-E4B32\.0085\.7681\.5874\.8378\.2186\.81/78\.04/71\.7374\.6766\.9871\.70Hy\-MT2\-7B35\.3386\.1476\.7768\.7372\.7579\.53 / 66\.50 / 54\.3589\.7372\.6783\.14Qwen3\.6\-35B\-A3B36\.0083\.0080\.7273\.3977\.0684\.84 / 78\.12 / 71\.4877\.7969\.7074\.66Gemma4\-26B\-A4B45\.6089\.8084\.8779\.2582\.0689\.17 /81\.89/75\.2983\.0273\.1879\.22Hy\-MT2\-30B\-A3B50\.6789\.8080\.4674\.3177\.3990\.10/ 72\.73 / 66\.6690\.2075\.9484\.69

Notes\.All scores are reported as percentages\. For Multi\-IF, each cell reports turn1 / turn2 / turn3 scores\.Tdenotes thinking mode; for models with both modes, the row withoutTdenotes non\-thinking mode\. IFMTBench is our translation instruction\-following benchmark, and Simple, Complex, and Total correspond to single\-constraint, multi\-constraint, and overall scores, respectively\. Baselines are grouped intoultra\-large general modelsandmedium to small\-sized general models\. The best results within the ultra\-large models are highlighted inblue background, while the best results within each corresponding parameter\-scale group of small\-to\-medium models are highlighted inpink background\.

We evaluate instruction\-following ability on both general instruction\-following benchmarks and our translation\-specific IFMTBench\. The results are reported in Table[4](https://arxiv.org/html/2605.22064#S3.T4)\.

Hy\-MT2 shows strong translation\-specific instruction\-following ability on IFMTBench\. Hy\-MT2\-7B achieves 89\.73 / 72\.67 / 83\.14 on Simple, Complex, and Total, respectively, outperforming Gemma4\-E4B and other medium\-sized open\-source baselines\. Hy\-MT2\-30B\-A3B further improves to 90\.20 / 75\.94 / 84\.69, achieving the best overall IFMTBench score among small\-to\-medium\-sized models\. Compared with Qwen3\.6\-35B\-A3B and Gemma4\-26B\-A4B, Hy\-MT2\-30B\-A3B obtains consistent gains, especially on Complex instructions, showing stronger capability in handling multi\-constraint translation requests\.

The performance of Hy\-MT2 is also competitive with much larger general\-purpose models\. On IFMTBench, Hy\-MT2\-30B\-A3B approaches Kimi K2\.6, GPT\-5\.5, and Gemini 3\.1 Pro in Total score, and even surpasses several ultra\-large models on Simple instructions\. This indicates that targeted optimization for translation instruction following can effectively improve constraint understanding and execution\.

On general instruction\-following benchmarks, Hy\-MT2\-30B\-A3B maintains solid performance, reaching 89\.80 on IFEval and 77\.39 overall on MaXIFE, outperforming Qwen3\.6\-35B\-A3B on both metrics\. However, its Multi\-IF scores are lower in later turns than some comparable baselines, suggesting that the main advantage of Hy\-MT2 lies in translation\-specific instruction following rather than general multi\-turn instruction following\.

### 3\.5Quantization Experiment

We evaluate the model size and performance of various quantized Hy\-MT2 models across general translation, domain\-specific translation, and instruction\-following benchmarks, as shown in Table[5](https://arxiv.org/html/2605.22064#S3.T5)\. Overall, quantization substantially reduces deployment cost while preserving strong translation quality\. For Hy\-MT2\-1\.8B and Hy\-MT2\-7B, FP8 achieves performance very close to BF16 across most benchmarks, indicating that low\-precision inference can be applied with minimal quality degradation\. Q4\_K\_M also maintains competitive performance, especially on FLORES\-200 and domain benchmarks, though larger drops appear on instruction\-following tasks such as IFMTBench\. The 2\-bit Hy\-MT2\-1\.8B shows a more noticeable decline, suggesting that extreme quantization introduces a stronger trade\-off between efficiency and accuracy\. For the MoE\-based Hy\-MT2\-30B\-A3B, FP8 remains highly stable compared with BF16, demonstrating that larger architectures are more robust to quantization\. These results show that Hy\-MT2 provides flexible deployment options, from highly efficient compact models to high\-quality large models, while maintaining strong multilingual translation capability\.

Table 5:Performance comparison on various quantized model\.ModelFLORES\-200WMT25MandDMTBWMTBIFBenchIFEVALIFMTBZH⇔\\LeftrightarrowXXEN⇔\\LeftrightarrowXXXX⇔\\LeftrightarrowXXHy\-MT2\-1\.8B\-BF1683\.4987\.0279\.2160\.3360\.3292\.2586\.7435\.3380\.2269\.36Hy\-MT2\-1\.8B\-FP883\.1186\.6678\.6359\.5159\.7392\.1586\.5735\.0080\.5967\.06Hy\-MT2\-1\.8B\-Q4\_K\_M82\.2285\.8777\.1957\.4657\.2391\.8586\.0933\.3378\.9363\.47Hy\-MT2\-1\.8B\-2bit80\.8684\.7476\.3157\.9658\.9689\.6783\.3334\.0077\.6358\.99Hy\-MT2\-7B\-BF1689\.1791\.4387\.0673\.0565\.4993\.8689\.6135\.3386\.1483\.14Hy\-MT2\-7B\-FP888\.9291\.4586\.9272\.5965\.5093\.8489\.4934\.0086\.5182\.38Hy\-MT2\-7B\-Q4\_K\_M88\.9691\.4686\.9072\.3065\.5793\.8189\.4137\.6785\.5875\.11Hy\-MT2\-30B\-A3B\-BF1690\.0492\.3788\.1373\.6265\.9394\.3989\.5650\.6789\.8084\.69Hy\-MT2\-30B\-A3B\-FP889\.9791\.8788\.0173\.3865\.9194\.3589\.6150\.3388\.7083\.47

Notes\.All scores are reported as percentages\. DMTB, WMTB, and IFMTB denote DomainMTBench, WildMTBench, and IFMTBench, respectively\. Mand denotes Mandarin\-minority language bidirectional translation\. IFMTB is our translation instruction\-following benchmark\.

### 3\.6Case Study on Translation Instruction Following

![Refer to caption](https://arxiv.org/html/2605.22064v1/x3.png)Figure 3:Case study of Hy\-MT2 on translation instruction following \(Part 1\)\.![Refer to caption](https://arxiv.org/html/2605.22064v1/x4.png)Figure 4:Case study of Hy\-MT2 on translation instruction following \(Part 2\)\.![Refer to caption](https://arxiv.org/html/2605.22064v1/x5.png)Figure 5:Case study of Hy\-MT2 on translation instruction following \(Part 3\)\.Figures[3](https://arxiv.org/html/2605.22064#S3.F3)to[5](https://arxiv.org/html/2605.22064#S3.F5)illustrate representative examples of Hy\-MT2 handling translation\-specific instructions\. These cases encompass common real\-world requirements, including style control, delimiter preservation, legal redaction, idiom explanation, subtitle translation with cultural preservation, legal\-document anonymization, and language\-learning\-oriented annotation\.

The examples demonstrate that Hy\-MT2 can handle both explicit translation constraints and compound user requirements\. In the style\-control case, the model rewrites a modern software update log into classical Chinese with a formal, memorial\-like tone\. For delimiter preservation, it accurately maintains special markers \(e\.g\., “\#\#”\) while translating surrounding content, showing precise formatting awareness\.

Hy\-MT2 also excels in more complex scenarios\. In legal redaction and anonymization tasks, it translates while applying required redactions, paragraph restructuring, and numbering rules\. In idiom\-explanation and language\-learning cases, it provides not only translations but also furigana, romaji output, and explanatory notes, meeting diverse instructional requirements\.

Moreover, in subtitle translation, Hy\-MT2 preserves culturally specific elements such as honorifics while following SRT formatting rules, reflecting its ability to manage multilingual instructions and maintain cultural fidelity\.

Overall, these examples confirm that Hy\-MT2 reliably executes diverse translation instructions, accommodating constraints on language, style, format, cultural context, and auxiliary explanatory needs\.

## 4Conclusion

In this paper, we present Hy\-MT2, a multilingual machine translation model family designed for real\-world translation scenarios\. Hy\-MT2 covers both dense and mixture\-of\-experts architectures, includingHy\-MT2\-1\.8B, Hy\-MT2\-7B, and Hy\-MT2\-30B\-A3B, all supporting translation among 33 languages\. Compared with Hy\-MT1\.5, Hy\-MT2 provides systematic improvements in domain\-specific translation, real\-world scenario translation, translation instruction following, model scaling, and efficient on\-device deployment\. Hy\-MT2\-7B and Hy\-MT2\-30B\-A3B outperform strong open\-source translation baselines such as DeepSeek\-V4\-Pro and Kimi K2\.6, and achieve performance close to or even surpassing leading closed\-source models such as Gemini 3\.1 Pro on multiple benchmarks\. The lightweight Hy\-MT2\-1\.8B also demonstrates strong small\-model translation capability, outperforming several commercial translation APIs\. To support diverse deployment scenarios, Hy\-MT2 is released in multiple precision formats, including1\.25\-bit, 2\-bit, 4\-bit, 8\-bit, and FP16\. Among them, the 1\.25\-bit and 2\-bit versions are built on Hunyuan self\-developed quantization techniques, significantly reducing model resource consumption while improving inference efficiency\. Overall, Hy\-MT2 provides a high\-quality, efficient, and multi\-capability multilingual translation model family for real\-world applications\.

## 5Contributions

### 5\.1Core Contributors

Mao Zheng, Zheng Li, Tao Chen, Bo Lv, Mingrui Sun, Mingyang Song, Jinlong Song, Hong Huang, Decheng Wu, Hai Wang, Yifan Song, Yanfeng Chen, Guanwei Zhang

### 5\.2Contributors

Guanghua Yu, Yi Su, Hong Liu, Jinxiang Ou, Keyao Wang, Weile Chen, Haozhao Kuang, Kai Wang, Nuo Chen, Zihao Zheng, Chenhao Wang, Bin Xing, Chengcheng Xu, Tinghao Yu, Binghong Wu, Long Xu, Jiacheng Shi, Yunhao Wang, Baifang Chen, Lei Zhang, Qi Yang, Zhao Wu, Jiacheng Li, Lan Jiang, Lanrui Wang, Kai Zhang, Shuaipeng Li, Zhongzhi Chen, Weixuan Sun, Jiaqi Zhu, An Wang, Wei Li, Jun Xia, Weidong Han, Wutian Yang, Litong Hui, Luoguo Jia, Jiajia Wu, Hongchuan Zeng, Xinpeng Zhou, Tianxiang Fei

## References

- DeepMind \(2025\)DeepMind\.Introducing gemini 3\.[https://blog\.google/products/gemini/gemini\-3\-collection/](https://blog.google/products/gemini/gemini-3-collection/), 2025\.Accessed: 2025\-12\-29\.
- Du et al\. \(2024\)Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu\.Bitdistiller: Unleashing the potential of sub\-4\-bit llms via self\-distillation\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 102–116, 2024\.
- Freitag et al\. \(2021\)Markus Freitag, Ricardo Rei, Nitika Mathur, Chi\-kiu Lo, Craig Stewart, George Foster, Alon Lavie, and Ondřej Bojar\.Results of the WMT21 metrics shared task: Evaluating metrics with expert\-based human evaluations on TED and news domain\.In Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R\. Costa\-jussa, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Tom Kocmi, Andre Martins, Makoto Morishita, and Christof Monz \(eds\.\),*Proceedings of the Sixth Conference on Machine Translation*, pp\. 733–774, Online, November 2021\. Association for Computational Linguistics\.URL[https://aclanthology\.org/2021\.wmt\-1\.73/](https://aclanthology.org/2021.wmt-1.73/)\.
- He et al\. \(2024\)Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, et al\.Multi\-if: Benchmarking llms on multi\-turn and multilingual instructions following\.*arXiv preprint arXiv:2410\.15553*, 2024\.
- Huang et al\. \(2026\)Hong Huang, Decheng Wu, Qiangqiang Hu, Guanghua Yu, Jinhai Yang, Jianchen Zhu, Xue Liu, and Dapeng Wu\.Sherry: Hardware\-efficient 1\.25\-bit ternary quantization via fine\-grained sparsification\.*arXiv preprint arXiv:2601\.07892*, 2026\.
- Kocmi et al\. \(2025\)Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Konstantin Dranch, Anton Dvorkovich, Sergey Dukanov, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Marzena Karpinska, Philipp Koehn, Howard Lakougna, Jessica Lundin, Christof Monz, Kenton Murray, Masaaki Nagata, Stefano Perrella, Lorenzo Proietti, Martin Popel, Maja Popović, Parker Riley, Mariya Shmatova, Steinthór Steingrímsson, Lisa Yankovskaya, and Vilém Zouhar\.Findings of the WMT25 general machine translation shared task: Time to stop evaluating on easy test sets\.In Barry Haddow, Tom Kocmi, Philipp Koehn, and Christof Monz \(eds\.\),*Proceedings of the Tenth Conference on Machine Translation*, pp\. 355–413, Suzhou, China, November 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-341\-8\.doi:10\.18653/v1/2025\.wmt\-1\.22\.URL[https://aclanthology\.org/2025\.wmt\-1\.22/](https://aclanthology.org/2025.wmt-1.22/)\.
- Liu et al\. \(2025\)Yile Liu, Ziwei Ma, Xiu Jiang, Jinglu Hu, ChangJing ChangJing, and Liang Li\.MaXIFE: Multilingual and cross\-lingual instruction following evaluation\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar \(eds\.\),*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 14252–14332, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.URL[https://aclanthology\.org/2025\.acl\-long\.698/](https://aclanthology.org/2025.acl-long.698/)\.
- Liu et al\. \(2026\)Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, et al\.Paretoq: Improving scaling laws in extremely low\-bit llm quantization\.*Advances in Neural Information Processing Systems*, 38:91311–91336, 2026\.
- Lommel et al\. \(2014\)Arle Lommel, Aljoscha Burchardt, Maja Popović, Kim Harris, Eleftherios Avramidis, and Hans Uszkoreit\.Using a new analytic measure for the annotation and analysis of MT errors on real data\.In Mauro Cettolo, Marcello Federico, Lucia Specia, and Andy Way \(eds\.\),*Proceedings of the 17th Annual Conference of the European Association for Machine Translation*, pp\. 165–172, Dubrovnik, Croatia, June 16\-18 2014\. European Association for Machine Translation\.URL[https://aclanthology\.org/2014\.eamt\-1\.38/](https://aclanthology.org/2014.eamt-1.38/)\.
- OpenAI \(2026\)OpenAI\.Openai gpt\-5 system card, 2026\.URL[https://arxiv\.org/abs/2601\.03267](https://arxiv.org/abs/2601.03267)\.
- Pyatkin et al\. \(2025\)Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi\.Generalizing verifiable instruction following, 2025\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*CoRR*, abs/2402\.03300, 2024\.doi:10\.48550/ARXIV\.2402\.03300\.URL[https://doi\.org/10\.48550/arXiv\.2402\.03300](https://doi.org/10.48550/arXiv.2402.03300)\.
- Team \(2026\)Hunyuan AI Infra Team\.Angelslim: A more accessible, comprehensive, and efficient toolkit for large model compression\.*arXiv preprint arXiv:2602\.21233*, 2026\.
- Team et al\. \(2022\)NLLB Team, Marta R\. Costa\-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang\.No language left behind: Scaling human\-centered machine translation, 2022\.URL[https://arxiv\.org/abs/2207\.04672](https://arxiv.org/abs/2207.04672)\.
- Zheng et al\. \(2025\)Mao Zheng, Zheng Li, Tao Chen, Mingyang Song, and Di Wang\.Hy\-mt1\.5 technical report, 2025\.URL[https://arxiv\.org/abs/2512\.24092](https://arxiv.org/abs/2512.24092)\.
- Zhou et al\. \(2023\)Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou\.Instruction\-following evaluation for large language models\.*arXiv preprint arXiv:2311\.07911*, 2023\.
Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

Similar Articles

tencent/Hy-MT2-7B

AngelSlim/Hy-MT1.5-1.8B-1.25bit

MultiSynt/MT: Trillion-Token Multi-Parallel Pre-Training Data Translated Across 36 Languages

Tencent Hy-MT2 is now under Apache License 2.0

Submit Feedback

Similar Articles

@FeitengLi: Hy-MT2 - a new open-source multilingual translation model that matches top-tier large models in capability, supports translation between 33 languages, and offers flexible instruction capabilities. It achieves 2-bit quantization under 500MB, making it well-suited for on-device deployment. https://modelsc…
AngelSlim/Hy-MT1.5-1.8B-1.25bit
MultiSynt/MT: Trillion-Token Multi-Parallel Pre-Training Data Translated Across 36 Languages
Tencent Hy-MT2 is now under Apache License 2.0