ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

arXiv cs.CL Papers

Summary

ReflectMT introduces a two-stage RL method that trains LRMs to internalize reflection, enabling single-pass high-quality translation with 94% fewer tokens than multi-step reasoning models like DeepSeek-R1.

arXiv:2604.19144v1 Announce Type: new Abstract: Recent years have witnessed growing interest in applying Large Reasoning Models (LRMs) to Machine Translation (MT). Existing approaches predominantly adopt a "think-first-then-translate" paradigm. Although explicit reasoning trajectories significantly enhance translation quality, they incur prohibitive inference costs and latency. To address these limitations, we propose ReflectMT, a two-stage reflection internalization algorithm for machine translation that employs a "translate-first-think-later" paradigm. Our approach develops the model's "translate-reflect-refine" capability through reinforcement learning. In the first stage, we cultivate the model's capacity for high-quality reflection and refinement, thereby enhancing its semantic comprehension and task-specific knowledge. In the second stage, we train the model to internalize the knowledge acquired during reflection. As a result, during inference, ReflectMT operates in a direct translation mode, producing high-quality translations on the first attempt without any explicit reasoning steps. Experimental results on datasets such as WMT24 demonstrate that our model's first-pass translations during inference outperform multi-step reasoning LRMs such as DeepSeek-R1 in both automatic metrics and GPT-based evaluation, achieving a 2.16-point improvement in GPT-based translation quality evaluation while reducing token consumption by 94.33%.
Original Article
View Cached Full Text

Cached at: 04/22/26, 08:30 AM

# ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
Source: [https://arxiv.org/html/2604.19144](https://arxiv.org/html/2604.19144)
Kunquan Li1, Yingxue Zhang2, Fandong Meng2,Jinsong Su1 1School of Informatics, Xiamen University, China 2WeChat AI, Tencent Inc, China likunquan@stu\.xmu\.edu\.cn, jssu@xmu\.edu\.cn \{yxuezhang,fandongmeng\}@tencent\.com

###### Abstract

Recent years have witnessed growing interest in applying Large Reasoning Models \(LRMs\) to Machine Translation \(MT\)\. Existing approaches predominantly adopt a “think\-first\-then\-translate” paradigm\. Although explicit reasoning trajectories significantly enhance translation quality, they incur prohibitive inference costs and latency\. To address these limitations, we propose ReflectMT, a two\-stage reflection internalization algorithm for machine translation that employs a “translate\-first\-think\-later” paradigm\. Our approach develops the model’s “translate–reflect–refine” capability through reinforcement learning\. In the first stage, we cultivate the model’s capacity for high\-quality reflection and refinement, thereby enhancing its semantic comprehension and task\-specific knowledge\. In the second stage, we train the model to internalize the knowledge acquired during reflection\. As a result, during inference, ReflectMT operates in a direct translation mode, producing high\-quality translations on the first attempt without any explicit reasoning steps\. Experimental results on datasets such as WMT24 demonstrate that our model’s first\-pass translations during inference outperform multi\-step reasoning LRMs such as DeepSeek\-R1 in both automatic metrics and GPT\-based evaluation, achieving a 2\.16\-point improvement in GPT\-based translation quality evaluation while reducing token consumption by 94\.33%\.

ReflectMT: Internalizing Reflection for Efficient and High\-Quality Machine Translation

Kunquan Li1††thanks:Work was done when Kunquan Li was interning at WeChat AI, Tencent Inc, China\., Yingxue Zhang2, Fandong Meng2, Jinsong Su11School of Informatics, Xiamen University, China2WeChat AI, Tencent Inc, Chinalikunquan@stu\.xmu\.edu\.cn, jssu@xmu\.edu\.cn\{yxuezhang,fandongmeng\}@tencent\.com

![Refer to caption](https://arxiv.org/html/2604.19144v1/x1.png)

Figure 1:Training vs\. Inference Paradigm of ReflectMT\. During training, the model generates the complete reflective translation process, including initial translation \(in black\), reflection analysis, revision decision, and final translation \(in gray\)\. During inference, while the model retains full reflective capabilities, we employ an early stopping strategy to output only the initial translation \(in black\)\.## 1Introduction

In recent years, Large Reasoning Models \(LRMs\), such as OpenAI\-o1OpenAI \([2024](https://arxiv.org/html/2604.19144#bib.bib2)\)and DeepSeek\-R1Guoet al\.\([2025](https://arxiv.org/html/2604.19144#bib.bib1)\), have demonstrated powerful capabilities in complex tasks such as mathematics, programming, and logical reasoning\. These models typically leverage long Chain\-of\-Thought \(CoT\) processes, exploring different solution paths through reasoning, self\-verification, and iterative refinement before generating the final answer\.

Inspired by these advances, researchers have begun exploring the introduction of long CoT into Machine Translation \(MT\)\. We categorize these methods as the “pre\-thinking” paradigm, wherein the model performs explicit reasoning before generating the translation\. For instance, Marco\-o1Zhaoet al\.\([2024](https://arxiv.org/html/2604.19144#bib.bib31)\)preliminarily validated the effectiveness of long CoT in slang translation with brief examples;Wanget al\.\([2024a](https://arxiv.org/html/2604.19144#bib.bib30)\)trained the DRT model through supervised fine\-tuning, demonstrating the potential of long CoT in literary translation; ExTransWanget al\.\([2025b](https://arxiv.org/html/2604.19144#bib.bib16)\)further proposed an exemplar\-enhanced reinforcement learning framework, achieving synergistic improvement of reasoning chains and translation quality\. However, although the explicit reasoning process significantly enhances the model’s ability to handle difficult translation tasks, the “pre\-thinking” paradigm requires the model to generate explicit and lengthy chains of thought during the inference stage, resulting in substantial computational overhead and latency, which limits its application\.

To address the above limitations, this paper proposes a novel “post\-thinking” algorithm for machine translation, named ReflectMT\. Unlike the “think\-first\-then\-translate” paradigm, our core motivation is to leverage the reflection mechanism to enhance model capabilities during training, enabling the direct generation of high\-quality translations during inference without incurring additional computational costs\. Specifically, we adopt a “generate\-reflect\-refine” workflow during the training stage: the model first generates an initial translation, then initiates a structured reflection process to critically evaluate it, and finally generates a refined translation\. Different from free\-form reflection in general tasks, we design a multi\-dimensional structured reflection process tailored to the specific needs of MT \(covering error identification, ambiguity analysis, etc\.\) and construct a high\-quality reflection dataset through multi\-agent collaboration\.

Furthermore, to achieve internalization of reflection capabilities, we propose a two\-stage progressive training strategy based on Reinforcement Learning \(RL\)\. This strategy mirrors the skill development of human translators: novices rely on explicit reflection and revision, whereas accumulated experience allows these processes to be internalized, enabling the direct generation of high\-quality translations\. In the first stage \(reflection capability establishment\), we guide the model to learn the complete “translate\-reflect\-refine” process through carefully designed reward functions\. In the second stage \(capability internalization and transfer\), we guide the model to integrate the acquired reflection knowledge into the initial translation process by reinforcing the reward signal for direct translation\. This training paradigm achieves the transformation from “explicit reflection” to “implicit capability”, enabling the model to directly output high\-quality translations during inference\.

Experimental results on multiple datasets, including WMT24, demonstrate that our method outperforms representative strong baseline models, including DeepSeek\-R1, in both automatic metrics and GPT\-based evaluation\. Notably, under the setting where explicit reflection is not performed during the inference stage, our model achieves a 2\.16\-point improvement in GPT\-based evaluation compared to the version executing the complete reflection process, while reducing token consumption by 94\.33%, truly achieving dual enhancement of translation efficiency and quality\.

The main contributions of this paper are summarized as follows:

![Refer to caption](https://arxiv.org/html/2604.19144v1/x2.png)

Figure 2:Overview of the Reflection Internalization Framework\. \(a\) Data Construction: A multi\-agent system synthesizes training data with reflection chains, including initial translations and refinements\. \(b\) Reinforcement Learning Strategy: Stage 1 establishes structured reflection and optimization capabilities, while Stage 2 internalizes reflection knowledge for effective first\-pass translation during inference\.- •We introduce a “post\-thinking” reflection mechanism into neural machine translation, designing a multi\-dimensional structured reflection process tailored to MT tasks that enables systematic evaluation and iterative improvement of translations\.
- •We propose a reflection internalization training algorithm based on RL, which internalizes reflection knowledge into the model’s direct generation capability through a two\-stage training strategy, significantly improving translation quality while eliminating additional computational overhead during inference\.
- •We construct a high\-quality MT reflection dataset via multi\-agent collaboration, providing a robust data foundation for learning complete reasoning chains\. Extensive ablation studies on the dataset validate the effectiveness of our proposed framework\.

## 2Methodology

### 2\.1Overview

We present our Reflection Internalization framework, which comprises three core modules: \(1\) constructing training data with complete reflection chains via multi\-agent collaboration \(Section[2\.2](https://arxiv.org/html/2604.19144#S2.SS2)\); \(2\) designing multi\-dimensional reward functions to jointly optimize format compliance, translation quality, reflection quality, and translation improvement \(Section[2\.3\.1](https://arxiv.org/html/2604.19144#S2.SS3.SSS1)\); and \(3\) employing a two\-stage Reinforcement Learning \(RL\) strategy to first establish and then internalize reflection capabilities \(Sections[2\.3\.2](https://arxiv.org/html/2604.19144#S2.SS3.SSS2)and[2\.3\.3](https://arxiv.org/html/2604.19144#S2.SS3.SSS3)\)\. The ultimate objective of our framework is to leverage explicit reflection steps during training to build cognitive capabilities, and subsequently internalize them, enabling the model to generate high\-quality translations in a single forward pass during inference without reasoning overhead\.

### 2\.2Data Construction

High\-quality reflective data is crucial for equipping models with explicit reasoning capabilities\. However, existing machine translation datasets typically only contain parallel corpora and lack intermediate reasoning steps\. To address this issue, we designed an iterative multi\-agent collaboration system that automatically constructs datasets containing complete reflection chains through dialogues between the Translator and the Reflector\.

As illustrated in Figure[2](https://arxiv.org/html/2604.19144#S1.F2), for each pre\-collected sentence \(denoted asxx\), we adopt a dual\-agent framework for translation\. The synthesis process is as follows:

\(1\)Initial Translation:The Translator generates an initial translation of the source sentencexx, denoted asy0=Translator​\(x\)y\_\{0\}=\\text\{Translator\}\(x\)\.

\(2\)Reflective Evaluation:The Reflector critically evaluatesy0y\_\{0\}to provide a multi\-dimensional assessment\. This includes a quality scorer0=Reflector\_score​\(x,y0\)∈\[0,100\]r\_\{0\}\\\!=\\\!\\text\{Reflector\\\_score\}\(x,y\_\{0\}\)\\\!\\in\\\!\[0,100\]based on predefined criteria \(such as semantic accuracy, cultural adaptability, and fluency\), alongside structured refinement suggestionsf0=Reflector\_suggest​\(x,y0,r0\)f\_\{0\}=\\text\{Reflector\\\_suggest\}\(x,y\_\{0\},r\_\{0\}\), that target specific lexical and structural deficiencies\.

\(3\)Iterative Refinement Loop:The agents collaboratively optimize the translation through a feedback loop\. At each iterationkk\(k≥1k\\geq 1\), the Translator generates an updated translationyk=Translator​\(x,yk−1,fk−1,rk−1\)y\_\{k\}=\\text\{Translator\}\(x,y\_\{k\-1\},f\_\{k\-1\},r\_\{k\-1\}\)\. The Reflector then returns a new scorerkr\_\{k\}and updated suggestionsfkf\_\{k\}\. This loop terminates when the score reaches a satisfactory threshold \(rk≥θr\_\{k\}\\geq\\theta\) or the maximum iteration limit is reached \(k≥Kmaxk\\geq K\_\{\\max\}\)\. The sequence\{\(yk,rk,fk\)\}k=0K\\\{\(y\_\{k\},r\_\{k\},f\_\{k\}\)\\\}\_\{k=0\}^\{K\}forms a complete reflective translation chain\.

Data filtering and hyperparameter settings are detailed in Appendix[A](https://arxiv.org/html/2604.19144#A1)\.

### 2\.3Progressive Reflection Internalization

Our training paradigm is inspired by the cognitive development of professional human translators: the ability to produce high\-quality first\-pass translations stems from internalizing long\-accumulated “deliberate reflection” experiences into intuitive competence\. Based on this insight, we design a two\-stage progressive Reinforcement Learning \(RL\) strategy\. Stage 1 trains the model to master the explicit “translate\-reflect\-refine” pipeline, while Stage 2 shifts the focus to the initial translation, forcing the model to internalize the acquired reflection knowledge into its first\-pass generation\.

#### 2\.3\.1Reward Modeling

To implement our RL algorithm, we design four reward components\. Given a source sentencexx, we require the model to generate outputs adhering to a strictly defined structural template:

¡answer¿yinity\_\{\\text\{init\}\}¡/answer¿ ¡reflection¿freflf\_\{\\text\{refl\}\}¡/reflection¿ ¡need\_revision¿vrevv\_\{\\text\{rev\}\}¡/need\_revision¿ ¡final\_answer¿yfiny\_\{\\text\{fin\}\}¡/final\_answer¿\.

We employ DeepSeek\-V3DeepSeek\-AIet al\.\([2025](https://arxiv.org/html/2604.19144#bib.bib9)\)as an LLM\-as\-a\-Judge, denoted as𝒥v3​\(⋅\)\\mathcal\{J\}\_\{\\text\{v3\}\}\(\\cdot\), to score texts on a 0\-100 scale\.

Format Reward \(rformr\_\{\\text\{form\}\}\)\.We use regular expressions to verify whether the output strictly contains all required XML tags in the correct order\.

rform=\{1if format is correct0otherwiser\_\{\\text\{form\}\}=\\begin\{cases\}1&\\text\{if format is correct\}\\\\ 0&\\text\{otherwise\}\\end\{cases\}\(1\)
Reflection Quality Reward \(rreflr\_\{\\text\{refl\}\}\)\.We evaluate the reflection contentfreflf\_\{\\text\{refl\}\}based on the accuracy of problem identification and the actionability of the suggestions:

rrefl=𝒥v3​\(x,yinit,frefl\)100r\_\{\\text\{refl\}\}=\\frac\{\\mathcal\{J\}\_\{\\text\{v3\}\}\(x,y\_\{\\text\{init\}\},f\_\{\\text\{refl\}\}\)\}\{100\}\(2\)Translation Quality Scores \(sinits\_\{\\text\{init\}\}andsfins\_\{\\text\{fin\}\}\)\.We score the quality \(e\.g\., semantic accuracy, fluency\) of both the initial and final translations:sinit=𝒥v3​\(x,yinit\)s\_\{\\text\{init\}\}=\\mathcal\{J\}\_\{\\text\{v3\}\}\(x,y\_\{\\text\{init\}\}\)andsfin=𝒥v3​\(x,yfin\)s\_\{\\text\{fin\}\}=\\mathcal\{J\}\_\{\\text\{v3\}\}\(x,y\_\{\\text\{fin\}\}\)\. To facilitate a smooth transition from explicit reflection to implicit internalization, the formulation of the Translation Quality Reward \(rtransr\_\{\\text\{trans\}\}\) dynamically adapts according to the training stage \(detailed in Sections 2\.3\.2 and 2\.3\.3\)\.

Reflection Improvement Reward \(rimpr\_\{\\text\{imp\}\}\)\.To encourage meaningful refinement, we define the score differenceΔ​s=sfin−sinit\\Delta s=s\_\{\\text\{fin\}\}\-s\_\{\\text\{init\}\}and design a piecewise reward:

rimp=\{1if​Δ​s≥ημ×Δ​sif​0<Δ​s<η0if​Δ​s≤0r\_\{\\text\{imp\}\}=\\begin\{cases\}1&\\text\{if \}\\Delta s\\geq\\eta\\\\ \\mu\\times\\Delta s&\\text\{if \}0<\\Delta s<\\eta\\\\ 0&\\text\{if \}\\Delta s\\leq 0\\end\{cases\}\(3\)whereμ\\muandη\\etaare predefined hyperparameters\.

Total Reward \(RR\)\.The total reward is a weighted sum of the aforementioned components:

R=wform​rform\+wtrans​rtrans\+wrefl​rrefl\+wimp​rimpR\\\!=\\\!w\_\{\\text\{form\}\}r\_\{\\text\{form\}\}\+w\_\{\\text\{trans\}\}r\_\{\\text\{trans\}\}\+w\_\{\\text\{refl\}\}r\_\{\\text\{refl\}\}\+w\_\{\\text\{imp\}\}r\_\{\\text\{imp\}\}\(4\)wherewform,wtrans,wrefl,wimpw\_\{\\text\{form\}\},w\_\{\\text\{trans\}\},w\_\{\\text\{refl\}\},w\_\{\\text\{imp\}\}denote the corresponding weights assigned to each reward component\. \(see Appendix[B](https://arxiv.org/html/2604.19144#A2)for detailed hyperparameter settings\)\.

#### 2\.3\.2Stage 1: Reflection Capability Establishment

In Stage 1, our objective is for the model to master the complete reflective translation task\. We use Qwen2\.5\-7B\-InstructYanget al\.\([2024](https://arxiv.org/html/2604.19144#bib.bib55)\)as the base model and perform a cold\-start Supervised Fine\-Tuning \(SFT\) via Low\-Rank Adaptation \(LoRA\)Huet al\.\([2021](https://arxiv.org/html/2604.19144#bib.bib11)\), leveraging the data constructed in Section[2\.2](https://arxiv.org/html/2604.19144#S2.SS2), enabling the model to generate the structured output format\.

Subsequently, we apply the GRPOShaoet al\.\([2024](https://arxiv.org/html/2604.19144#bib.bib14)\)algorithm\. In this stage, we expect the model to produce a reasonable initial translation and a highly refined final translation\. Therefore, the translation quality reward for Stage 1 is defined as the average of the initial and final translation scores:

rtransstage1=sinit\+sfin200r\_\{\\text\{trans\}\}^\{\\text\{stage1\}\}=\\frac\{s\_\{\\text\{init\}\}\+s\_\{\\text\{fin\}\}\}\{200\}\(5\)
Given a source sentencexx, GRPO samplesnngenerations\{g1,…,gn\}\\\{g\_\{1\},\\ldots,g\_\{n\}\\\}from the policyπ\\pi\. We compute the Stage 1 total rewardrir\_\{i\}for eachgig\_\{i\}using Equation[4](https://arxiv.org/html/2604.19144#S2.E4)\. GRPO optimizes the policyπ′\\pi^\{\\prime\}by maximizing:

1n∑i=1n\[min\(ρiAi,clip\(ρi,1−ϵ,1\+ϵ\)Ai\)\\displaystyle\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\Bigg\[\\min\\Big\(\\rho\_\{i\}A\_\{i\},\\text\{clip\}\(\\rho\_\{i\},1\-\\epsilon,1\+\\epsilon\)A\_\{i\}\\Big\)\(6\)−β𝔻KL\(π′∥πref\)\]\\displaystyle\-\\beta\\mathbb\{D\}\_\{\\text\{KL\}\}\\bigl\(\\pi^\{\\prime\}\\big\\\|\\pi\_\{\\text\{ref\}\}\\bigr\)\\Bigg\]
whereρi=π′​\(gi\|x\)π​\(gi\|x\)\\rho\_\{i\}=\\frac\{\\pi^\{\\prime\}\(g\_\{i\}\|x\)\}\{\\pi\(g\_\{i\}\|x\)\},πref\\pi\_\{\\text\{ref\}\}is the cold\-start SFT model, andϵ,β\\epsilon,\\betaare hyperparameters\. The advantageAiA\_\{i\}is computed by normalizing the rewards:

Ai=ri−mean⁡\(\{r1,r2,⋯,rn\}\)std⁡\(\{r1,r2,⋯,rn\}\)\.A\_\{i\}=\\frac\{r\_\{i\}\-\{\\operatorname\{mean\}\(\\\{r\_\{1\},r\_\{2\},\\cdots,r\_\{n\}\\\}\)\}\}\{\{\\operatorname\{std\}\(\\\{r\_\{1\},r\_\{2\},\\cdots,r\_\{n\}\\\}\)\}\}\.\(7\)

#### 2\.3\.3Stage 2: Capability Internalization and Transfer

In Stage 2, we shift the training focus to the quality of the initial translation\. To force the model to internalize the reflection knowledge acquired in Stage 1 into its first\-round generation, we introduce two critical modifications to the reward mechanism\.

First, the translation quality reward is solely determined by the initial translation score, disregarding the final translation:

rtransstage2=sinit100r\_\{\\text\{trans\}\}^\{\\text\{stage2\}\}=\\frac\{s\_\{\\text\{init\}\}\}\{100\}\(8\)
Second, we adjust the weight configuration of the total reward by significantly increasing the translation weightwtransneww\_\{\\text\{trans\}\}^\{\\text\{new\}\}and correspondingly decreasing the improvement weightwimpneww\_\{\\text\{imp\}\}^\{\\text\{new\}\}\. However, the model is still required to output the full structured tags during training to maintain RL stability and prevent catastrophic forgetting of the reasoning format\. This reward restructuring forces the model to produce a near\-perfect translation on its very first attempt \(yinity\_\{\\text\{init\}\}\), thereby reducing its reliance on subsequent reflection steps\.

Inference Phase\.Owing to this internalization mechanism, during inference, we employ an early stopping strategy that terminates generation upon detecting the</answer\>token\. The model directly outputs high\-quality first\-pass translations, completely bypassing<reflection\>and<final\_answer\>\. This effectively eliminates the computational overhead and latency associated with explicit long reasoning chains\.

## 3Experiments

ModelOurWMT23WMT24FLORES\-200GRFMX24CKTokenGRFMX24CKTokenGRFMX24CKTokenGRFMX24CKTokenQwen\-7B\-Instruct73\.522\.776\.4333\.0875\.542\.6578\.8328\.3369\.513\.2376\.5139\.3777\.822\.3782\.0925\.69Qwen2\.5\-14B74\.212\.6577\.0229\.0683\.692\.1780\.6731\.1477\.632\.7778\.8143\.3885\.831\.7584\.1829\.59Qwen2\.5\-14B\(w/ refl\)79\.082\.5877\.21188\.4488\.482\.0780\.85193\.3885\.332\.6379\.29208\.8688\.191\.6784\.47183\.86Llama3\.1\-8B62\.193\.274\.9839\.1875\.352\.5879\.2931\.7165\.573\.4675\.9552\.3475\.822\.282\.4946\.87Qwen3\-8B79\.542\.5377\.9628\.6388\.2281\.18588\.5982\.562\.5979\.58654\.0689\.491\.6184\.64612\.31Qwen3\-8B\(w/ refl\)81\.782\.4278\.03863\.1289281\.24738\.9285\.922\.5179\.66786\.6390\.781\.6784\.72726\.28Marco\-o1\-7B68\.432\.975\.8489\.2479\.582\.3879\.42548\.3474\.672\.9777\.35584\.382\.721\.8683\.5535\.6Marco\-o1\-7B\(w/ refl\)75\.092\.8476\.07731\.7185\.342\.2680\.03725\.5181\.882\.8878\.18763\.9786\.841\.8683\.55720\.18DeepSeek\-R1783\.0376\.5541\.8888\.372\.0679\.39593\.1284\.032\.6477\.93605\.4290\.231\.883\.17783\.2DeepSeek\-R1\(w/ refl\)81\.152\.576\.67659\.8389\.791\.9381\.041029\.6885\.612\.5679\.281069\.0790\.51\.5684\.8982\.33QwQ\-32B79\.52\.2277\.77523\.7285\.512\.7978\.59632\.5678\.93\.0374\.82684\.2989\.032\.1982\.84547\.19QwQ\-32B\(w/ refl\)81\.652\.1677\.971063\.0187\.172\.678\.331136\.5984\.472\.9877\.61110289\.862\.1182\.941007\.71DeepTrans\-7B753\.2972\.62386\.6678\.232\.6177\.122018\.9676\.222\.9977\.392581\.5282\.362\.5881\.92707\.06DRT\-7B75\.692\.9375\.48478\.9979\.72\.5478\.87479\.6476\.39377\.4520\.0385\.152\.2282\.85463\.02ExTrans\-7B78\.952\.3377\.63979\.0881\.832\.1778\.961547\.5581\.72\.7277\.511957\.7986\.641\.8282\.891232\.43Qwen\-7B\-Adapt\-LoRA76\.222\.6377\.1151\.382\.152\.1380\.55150\.1576\.592\.5778\.74183\.7585\.651\.683\.22139\.48Qwen\-7B\-Adapt\-RL†79\.812\.2777\.6256\.2983\.71\.7380\.77192\.7379\.42\.1579\.68239\.3186\.531\.3784\.56172\.99Qwen\-7B\-Full\-RL‡78\.592\.3277\.43127\.4782\.891\.7480\.71201\.3179\.922\.1679\.67245\.9986\.291\.3584\.71187\.63ReflectMT\-SFT\(Cold Start\)76\.352\.6576\.723\.2586\.012\.2180\.3525\.9480\.822\.7379\.1237\.5787\.531\.784\.2424\.09ReflectMT\-Stage1\(Our\)79\.732\.3777\.4529\.4187\.891\.8280\.7828\.9183\.392\.4979\.4141\.3989\.321\.4584\.3924\.88ReflectMT\-Stage2\(Our\)82\.55278\.0530\.7389\.481\.781\.1830\.6786\.162\.179\.8647\.5791\.421\.3384\.7229\.1

Table 1:Experimental results in English\-to\-Chinese translation\.Boldandunderlinedvalues denote the best and second\-best scores, respectively\.†\\daggerindicates Qwen with adaptive thought, while‡\\ddaggerindicates full thought\.![Refer to caption](https://arxiv.org/html/2604.19144v1/x3.png)\(a\)Number of Modifications
![Refer to caption](https://arxiv.org/html/2604.19144v1/x4.png)\(b\)Accuracy
![Refer to caption](https://arxiv.org/html/2604.19144v1/x5.png)\(c\)Divergence
![Refer to caption](https://arxiv.org/html/2604.19144v1/x6.png)\(d\)Modifications by Difficulty

Figure 3:Training Dynamics: As the training steps increase \(a\) indicates the number of refinements; \(b\) represents the GRF score; \(c\) illustrates the score difference between initial and final translations; \(d\) shows the number of refinements for tasks of varying difficulty\.ModelOurWMT23WMT24FLORES\-200GRFMX24CKTokenGRFMX24CKTokenGRFMX24CKTokenGRFMX24CKTokenReflectMT\-SFT\-initial76\.352\.6576\.723\.2586\.012\.2180\.3525\.9480\.822\.7379\.1237\.5787\.531\.784\.2424\.09ReflectMT\-SFT\-final77\.262\.6377\.32297\.5186\.052\.1680\.87273\.7883\.152\.6979\.42323\.7587\.831\.6984\.3248\.34ReflectMT\-Stage1\-initial79\.732\.3777\.4529\.4187\.891\.8280\.7828\.9183\.392\.4979\.4141\.3989\.321\.4584\.3924\.88ReflectMT\-Stage1\-final80\.512\.2377\.81261\.2588\.461\.8780\.99239\.1584\.552\.2479\.51298\.2790\.061\.3984\.5322\.28ReflectMT\-Stage2\-initial82\.55278\.0530\.7389\.481\.781\.1830\.6786\.162\.179\.8647\.5791\.421\.3384\.7229\.1ReflectMT\-Stage2\-final82\.55278\.05322\.2989\.491\.781\.18288\.3586\.192\.179\.87350\.7791\.471\.3384\.72244\.5

Table 2:Comparative performance of initial translations and final refinements across different training stages\. The shrinking performance gap demonstrates the successful internalization of reflection capabilities\.### 3\.1Experimental Setups

Data\.We constructed a reflective translation dataset specifically designed for English\-to\-Chinese machine translation, with the data volume shown in Table[4](https://arxiv.org/html/2604.19144#A4.T4)\. Each sample comprises the source sentence, the initial translation, the reflective analysis, and the final translation\. In addition, we used official test sets from WMT23111[https://www2\.statmt\.org/wmt23/translation\-task\.html](https://www2.statmt.org/wmt23/translation-task.html), WMT24222[https://www2\.statmt\.org/wmt24/translation\-task\.html](https://www2.statmt.org/wmt24/translation-task.html), and FLORES\-200Teamet al\.\([2022](https://arxiv.org/html/2604.19144#bib.bib13)\)to evaluate the model’s generalization ability\. Comprehensive dataset statistics and qualitative examples are provided in Appendix[A](https://arxiv.org/html/2604.19144#A1)\.

Metrics\.FollowingWanget al\.\([2024a](https://arxiv.org/html/2604.19144#bib.bib30)\), we primarily employed the evaluation metric GRF \(GPT Reference\-Free\) to assess the translation results of the model\. GRF evaluates translations without requiring human references\. The evaluation prompt is detailed in Appendix[B](https://arxiv.org/html/2604.19144#A2)\. Additionally, we reported two widely used semantic\-level evaluation metrics based on pre\-trained models: COMETKIWI\-XL\(CK\)Reiet al\.\([2023](https://arxiv.org/html/2604.19144#bib.bib8)\)and MetricX\-24\(MX24\)Juraskaet al\.\([2024](https://arxiv.org/html/2604.19144#bib.bib12)\)\. In terms of efficiency, we evaluate using the number of tokens, calculated uniformly as “average number of output tokens per sentence”\.

Backbones\.We use Qwen2\.5\-7b as our base model\. For detailed training information, please refer to Appendix[C](https://arxiv.org/html/2604.19144#A3)\.

### 3\.2Baselines

To comprehensively evaluate the performance of our model, we selected three categories of baseline models, covering general large language models, reasoning\-based large models, and specialized machine translation models\.

General LLMs\.We use Llama3\.1\-8BGrattafioriet al\.\([2024](https://arxiv.org/html/2604.19144#bib.bib57)\), Qwen2\.5\-7B, Qwen2\.5\-14BYanget al\.\([2024](https://arxiv.org/html/2604.19144#bib.bib55)\), and GPT\-4oOpenAI \([2024](https://arxiv.org/html/2604.19144#bib.bib2)\)\(see results in Appendix[E](https://arxiv.org/html/2604.19144#A5)\) as baselines\.

LRMs\.We use Marco\-o1\-7BZhaoet al\.\([2024](https://arxiv.org/html/2604.19144#bib.bib31)\), QwQ\-32BQwen \([2024](https://arxiv.org/html/2604.19144#bib.bib60)\), DeepSeek\-R1Guoet al\.\([2025](https://arxiv.org/html/2604.19144#bib.bib1)\), and Qwen3\-8BYanget al\.\([2025](https://arxiv.org/html/2604.19144#bib.bib20)\)as baselines\.

Specialized MT models\.We train an adaptive pre\-thinking baseline based on Qwen2\.5\-7B\-Instruct \(Appendix[D](https://arxiv.org/html/2604.19144#A4)\)\. We also include DRT\-7BWanget al\.\([2024a](https://arxiv.org/html/2604.19144#bib.bib30)\), DeepTrans\-7BWanget al\.\([2025a](https://arxiv.org/html/2604.19144#bib.bib49)\), and ExTrans\-7BWanget al\.\([2025b](https://arxiv.org/html/2604.19144#bib.bib16)\), which are fine\-tuned on MetaphorTransWanget al\.\([2024a](https://arxiv.org/html/2604.19144#bib.bib30)\)for literary translation and follow a “think\-first\-then\-translate” paradigm\.

To ensure a rigorous and equitable comparison, we additionally evaluated the baseline models under a “translate\-reflect\-refine” setting, denoted asw/ refl\. Specifically, during inference, we apply an identical structured reflective prompt to the baselines\. Notably, both our method and the baselines demonstrate perfect compliance with this format, consistently outputting the initial translation, reflective analysis, and final translation without exception\.

### 3\.3Main Result

Our ReflectMT models employ a direct translation mode \(without reflection mechanisms\) during inference\. Experimental results demonstrate that ReflectMT achieves substantial improvements across all evaluation metrics\. To comprehensively validate its effectiveness, we conduct comparative analyses from four perspectives:

Comparison with general LLMs\.As shown in Table[1](https://arxiv.org/html/2604.19144#S3.T1), our optimal model ReflectMT\-Stage2 achieves a GRF score of 82\.55 on our dataset, representing a 12\.3% improvement over the baseline model Qwen2\.5\-7B\-Instruct \(73\.52\)\. For MetricX\-24 and CometKiwi metrics, our method relative improvements of 26\.0% and 2\.1%, respectively\. Notably, our model consumes an average of 30\.73 tokens per sample during inference, nearly identical to the non\-reflective general LLM \(33\.08 tokens/sample\), introducing negligible additional inference overhead\.

Comparison with strong LRM baselines\.Our model demonstrates superior performance over strong LRMs, outperforming QwQ\-32B \(79\.50\) and DeepSeek\-R1 \(78\.00\) by 3\.05 and 4\.55 GRF points, respectively\. To ensure a rigorous comparison, we also evaluated these baselines in a “translate\-reflect\-refine” mode\. While this explicit reflection universally improves their translation quality \(e\.g\., DeepSeek\-R1 w/ refl improves to 81\.15\), it necessitates generating lengthy chains of thought, resulting in substantially higher token consumption \(e\.g\., DeepSeek\-R1 consumes 541\.88 tokens/sample, 17×\\timesthat of our model\)\. In contrast, our model, built upon the compact Qwen2\.5\-7B backbone, internalizes this reflection capability to generate high\-quality translations directly, offering significant advantages in both parameter scale and computational efficiency\.

Comparison with MT\-specialized models\.Compared to existing reasoning models specialized for machine translation, our method achieves optimal performance across all datasets and evaluation metrics\. Relative to ExTrans\-7B \(78\.95\), our method improves by 3\.6 GRF points\.

Comparison with pre\-thinking models\.To verify the advantages of post\-editing reflection over pre\-thinking, we trained a pre\-thinking baseline under identical training configurations\. Compared to the adaptive pre\-thinking model \(79\.81\), our method improves by 2\.74 GRF points while reducing token consumption by 45%\. This strongly indicates that the post\-thinking reflection paradigm is superior to the pre\-thinking approach in both translation quality and token efficiency\.

### 3\.4Reflection Internalization Analysis

Training Dynamics of Reflection Internalization\.We evaluate the model every 50 steps during Stage 2 on a 2000\-sample test set, temporarily disabling early\-stopping to observe both initial and final translations\. As shown in Figure[3](https://arxiv.org/html/2604.19144#S3.F3)\(b\), GRF scores exhibit a fluctuating upward trend, reflecting the model’s adjustment of internal representations to balance preliminary translation quality and reflective improvement ability during the gradual internalization of its reflective capability\.

Furthermore, we monitor the frequency of explicit modifications\. Crucially, the model learns to self\-assess, triggering refinement only when it judges the initial translation as sub\-optimal\. As shown in Figure[3](https://arxiv.org/html/2604.19144#S3.F3)\(a\), modifications drop monotonically from 578 \(28\.9%\) at Step 0 to a mere 14 \(0\.7%\) by Step 330\. This trend indicates that as the reflective capability is internalized, the model generates higher\-quality translations in the first pass\. Figure[3](https://arxiv.org/html/2604.19144#S3.F3)\(c\) shows that the score difference between first\-pass and final translations exhibits a fluctuating downward trend\. As the reflective capability is gradually internalized, the model ultimately generates optimal translations in the first pass without subsequent modifications\. Table[2](https://arxiv.org/html/2604.19144#S3.T2)displays first\-pass and final translation results at different training stages\. This demonstrates the effectiveness of our reflective internalization training strategy: the model successfully incorporates the knowledge learned during reflection into its initial translation capability, achieving quality close to the final translation during the first generation\.

Performance Across Difficulty Levels\.To evaluate model performance on translation tasks of varying difficulty, we divided samples into three levels based on GRF scores: easy \(¿90\), medium \(70\-90\), and difficult \(¡70\), with proportions approximately 11:6:3\. We recorded the number of samples requiring reflective modifications during training at each difficulty level\.

As shown in Figure[3](https://arxiv.org/html/2604.19144#S3.F3)\(d\), at Step 0, the model modified 65 easy, 218 medium, and 295 difficult samples, indicating significant room for improvement in the initial translation stage despite acquired reflective capability after first\-phase training, especially for difficult samples\.

After second\-phase reflective internalization training, model performance at Step 330 showed qualitative improvement\. For easy samples, the model required no reflective modifications, indicating first\-round translations meet high\-quality standards\. For difficult samples, modifications drastically decreased from 295 to 12, a 96% reduction\. This demonstrates the model’s sensitivity to translation difficulty: for easy and medium samples, the model fully internalized reflective capability, generating high\-quality translations during initial translation; for genuinely difficult samples, the model retained reflective improvement capability to ensure translation quality\.

### 3\.5Ablation Study

To validate the effectiveness of the reflection mechanism, we compare three training configurations, whereinitandrefldenote initial translation and reflection:

1\)ReflectMT \(w/o init & refl\)\.The model is trained to map the source sentence directly to the final translation, omitting both the initial translation and the reflection steps\.

2\)ReflectMT \(w/o refl\)\.The model is trained to generate an initial translation followed directly by a refined translation, bypassing the explicit reflection analysis\. This variant learns a “translation\-refinement” process\.

3\)ReflectMT \(full\)\.Our complete configuration, which trains the model on the full “translation\-reflection\-refinement” trajectory\.

Table[3](https://arxiv.org/html/2604.19144#S3.T3)demonstrates that the reflection mechanism significantly enhances translation quality\. During the LoRA\-SFT stage, thew/o reflvariant yields a final GRF score \(76\.49\) lower than its initial translation \(76\.85\), indicating that blind refinement without explicit guidance can degrade performance\. Conversely, thefullmodel improves from 76\.35 to 77\.26, proving reflection provides accurate modification directions\. Having established the necessity of reflection during SFT, we evaluate the ultimate performance ceiling after RL training\. Comparing direct translation \(w/o init & refl, 78\.62\) to the full ReflectMT \(80\.51\), the 1\.89\-point GRF gap confirms the substantial benefit of the complete “translate\-reflect\-refine” paradigm\.

Model ConfigurationInitialFinalQwen\-7B\-Instruct \(Baseline\)73\.52–ReflectMT\-LoRA \(w/o init & refl\)–75\.86ReflectMT\-LoRA \(w/o refl\)76\.8576\.49ReflectMT\-LoRA \(full\)76\.3577\.26ReflectMT\-RL \(w/o init & refl\)–78\.62ReflectMT\-RL \(full\)79\.7380\.51

Table 3:Ablation study on the reflection mechanism evaluated by GRF scores\. “Initial” and “Final” refer to the translations generated before and after explicit reflection\. The “RL” stage is initialized from the checkpoint of the LoRA\-SFT model\.

## 4Related Work

Application of Deep Reasoning Models in Machine Translation\.With the emergence of Large Reasoning Models \(LRMs\) such as OpenAI o1OpenAI \([2024](https://arxiv.org/html/2604.19144#bib.bib2)\)and DeepSeek\-R1Guoet al\.\([2025](https://arxiv.org/html/2604.19144#bib.bib1)\), deep reasoning capabilities have pioneered extensive research in long Chain\-of\-Thought \(CoT\) reasoningGaoet al\.\([2026](https://arxiv.org/html/2604.19144#bib.bib75)\)\. Recently, researchers have begun exploring their potential in Machine Translation \(MT\)\.Zhaoet al\.\([2024](https://arxiv.org/html/2604.19144#bib.bib31)\)andLiuet al\.\([2025b](https://arxiv.org/html/2604.19144#bib.bib37)\)demonstrated the prospects of long CoT reasoning in translation through heuristic examples\.Wanget al\.\([2024a](https://arxiv.org/html/2604.19144#bib.bib30)\)constructed the MetaphorTrans dataset and trained the DRT model to handle rhetorical devices in literary translation\. DeepTrans\-7BWanget al\.\([2025a](https://arxiv.org/html/2604.19144#bib.bib49)\)leveraged DeepSeek\-V3 for reference\-free evaluation during Reinforcement Learning \(RL\) training, exploiting the powerful capability of large language models as judges\. Building on this,Wanget al\.\([2025b](https://arxiv.org/html/2604.19144#bib.bib16)\)proposed the ExTrans model with exemplar\-enhanced RL reward modeling, integrating dual capabilities of LLMs as both exemplar generators and judges, achieving breakthroughs in deep reasoning MT and extending to low\-resource multilingual translation\.

Reflection and self\-improving Behind Long CoT\.Reflection and self\-improvement capabilities underlying long CoT have been extensively validated in LLM researchBaoet al\.\([2026](https://arxiv.org/html/2604.19144#bib.bib76)\)\.Shinnet al\.\([2023](https://arxiv.org/html/2604.19144#bib.bib3)\)proposed the Reflexion framework, systematically embedding self\-reflection into LLM reasoning processes to enable iterative evaluation and improvement\.Jiet al\.\([2023](https://arxiv.org/html/2604.19144#bib.bib4)\)explored self\-reflection techniques to mitigate hallucination problems through factual checking and logical consistency verification\.Wenget al\.\([2023](https://arxiv.org/html/2604.19144#bib.bib5)\)proposed a self\-verification method introducing multiple verification points to prevent error propagation in CoT reasoning\. Recent work has further optimized reflection efficiency and adaptability\.Wanget al\.\([2024b](https://arxiv.org/html/2604.19144#bib.bib6)\)proposed the TasTe framework, improving translation through instruction fine\-tuning for two\-stage reflective translation\.Liuet al\.\([2025a](https://arxiv.org/html/2604.19144#bib.bib7)\)proposed the IoRT framework, addressing redundancy and drift in static reflection through dynamic meta\-instruction\-guided processes\.Costaet al\.\([2026](https://arxiv.org/html/2604.19144#bib.bib15)\)proposed PR\-CoT, enabling multi\-perspective structured reflection to improve model adaptability in complex tasks\.

## 5Conclusion

This paper introduces a novel “translate\-reflect\-refine” paradigm for machine translation\. We design a multi\-agent collaborative system to construct reflective translation datasets and propose a two\-stage reinforcement learning strategy that internalizes reflection capabilities into the model’s direct responses\. Extensive experiments demonstrate that our approach significantly outperforms strong baselines, including both LRMs and specialized translation models\. Critically, our model generates high\-quality translations without explicit reflection steps during inference, avoiding the computational overhead of long COT reasoning while maintaining superior translation quality\. Ablation studies validate the effectiveness of our reflection mechanism in providing interpretable reasoning paths for systematic improvement\.

## Limitations

While we have demonstrated the effectiveness of our reflection framework, several limitations are worth noting: Our research primarily focuses on English\-to\-Chinese translation tasks and has not been sufficiently validated on other language pairs\. Significant linguistic differences exist across different language pairs, and the effectiveness of our reflection mechanism on other language pairs requires further empirical investigation\. Moreover, while our model avoids the computational overhead of long COT reasoning during inference, the training phase—including the data construction process using the multi\-agent collaborative system and the two\-stage reinforcement learning training—still requires substantial computational resources, which may limit the application of this method in resource\-constrained environments\.

## References

- Y\. Bao, X\. Wang, and X\. Tan \(2026\)To deceive is to teach? forging perceptual robustness via adversarial reinforcement learning\.arXiv preprint arXiv:2602\.22227\.Cited by:[§4](https://arxiv.org/html/2604.19144#S4.p2.1)\.
- M\. Costa, A\. R\. Soarez, D\. Kim, and C\. Ferreira \(2026\)Enhancing self\-correction in large language models through multi\-perspective reflection\.External Links:2601\.07780,[Link](https://arxiv.org/abs/2601.07780)Cited by:[§4](https://arxiv.org/html/2604.19144#S4.p2.1)\.
- DeepSeek\-AI, A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Zhang, H\. Ding, H\. Xin, H\. Gao, H\. Li, H\. Qu, J\. L\. Cai, J\. Liang, J\. Guo, J\. Ni, J\. Li, J\. Wang, J\. Chen, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, L\. Zhao, L\. Wang, L\. Zhang, M\. Li, M\. Wang, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, N\. Tian, P\. Huang, P\. Wang, P\. Zhang, Q\. Wang, Q\. Zhu, Q\. Chen, Q\. Du, R\. J\. Chen, R\. L\. Jin, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Xu, R\. Zhang, R\. Chen, S\. S\. Li, S\. Lu, S\. Zhou, S\. Chen, S\. Wu, S\. Ye, S\. Ye, S\. Ma, S\. Wang, S\. Zhou, S\. Yu, S\. Zhou, S\. Pan, T\. Wang, T\. Yun, T\. Pei, T\. Sun, W\. L\. Xiao, W\. Zeng, W\. Zhao, W\. An, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, X\. Q\. Li, X\. Jin, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Shen, X\. Chen, X\. Zhang, X\. Chen, X\. Nie, X\. Sun, X\. Wang, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Song, X\. Shan, X\. Zhou, X\. Yang, X\. Li, X\. Su, X\. Lin, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. X\. Zhu, Y\. Zhang, Y\. Xu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Li, Y\. Wang, Y\. Yu, Y\. Zheng, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Tang, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Wu, Y\. Ou, Y\. Zhu, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Zha, Y\. Xiong, Y\. Ma, Y\. Yan, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Huang, Z\. Zhang, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Xu, Z\. Wu, Z\. Zhang, Z\. Li, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Gao, and Z\. Pan \(2025\)DeepSeek\-v3 technical report\.External Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437)Cited by:[§2\.3\.1](https://arxiv.org/html/2604.19144#S2.SS3.SSS1.p3.1)\.
- Z\. Gao, X\. Wang, X\. Tan, and Y\. Xie \(2026\)TPRU: advancing temporal and procedural understanding in large multimodal models\.arXiv preprint arXiv:2602\.18884\.Cited by:[§4](https://arxiv.org/html/2604.19144#S4.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§3\.2](https://arxiv.org/html/2604.19144#S3.SS2.p2.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2604.19144#S1.p1.1),[§3\.2](https://arxiv.org/html/2604.19144#S3.SS2.p3.1),[§4](https://arxiv.org/html/2604.19144#S4.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2021\)LoRA: low\-rank adaptation of large language models\.External Links:2106\.09685,[Link](https://arxiv.org/abs/2106.09685)Cited by:[§2\.3\.2](https://arxiv.org/html/2604.19144#S2.SS3.SSS2.p1.1)\.
- Z\. Ji, T\. Yu, Y\. Xu, N\. Lee, E\. Ishii, and P\. Fung \(2023\)Towards mitigating hallucination in large language models via self\-reflection\.External Links:2310\.06271,[Link](https://arxiv.org/abs/2310.06271)Cited by:[§4](https://arxiv.org/html/2604.19144#S4.p2.1)\.
- J\. Juraska, D\. Deutsch, M\. Finkelstein, and M\. Freitag \(2024\)MetricX\-24: the google submission to the wmt 2024 metrics shared task\.External Links:2410\.03983,[Link](https://arxiv.org/abs/2410.03983)Cited by:[§3\.1](https://arxiv.org/html/2604.19144#S3.SS1.p2.1)\.
- T\. Kocmi and C\. Federmann \(2023\)Large language models are state\-of\-the\-art evaluators of translation quality\.InProceedings of the 24th Annual Conference of the European Association for Machine Translation,M\. Nurminen, J\. Brenner, M\. Koponen, S\. Latomaa, M\. Mikhailov, F\. Schierl, T\. Ranasinghe, E\. Vanmassenhove, S\. A\. Vidal, N\. Aranberri, M\. Nunziatini, C\. P\. Escartín, M\. Forcada, M\. Popovic, C\. Scarton, and H\. Moniz \(Eds\.\),Tampere, Finland,pp\. 193–203\.External Links:[Link](https://aclanthology.org/2023.eamt-1.19/)Cited by:[Appendix B](https://arxiv.org/html/2604.19144#A2.p1.1),[Appendix B](https://arxiv.org/html/2604.19144#A2.p3.1)\.
- Y\. Liang, F\. Meng, J\. Wang, and J\. Zhou \(2025\)SlangDIT: benchmarking llms in interpretative slang translation\.External Links:2505\.14181,[Link](https://arxiv.org/abs/2505.14181)Cited by:[Appendix A](https://arxiv.org/html/2604.19144#A1.p1.1)\.
- L\. Liu, C\. Zhang, L\. Wu, C\. Zhao, Z\. Hu, M\. He, and J\. Fan \(2025a\)Instruct\-of\-reflection: enhancing large language models iterative reflection capabilities via dynamic\-meta instruction\.External Links:2503\.00902,[Link](https://arxiv.org/abs/2503.00902)Cited by:[§4](https://arxiv.org/html/2604.19144#S4.p2.1)\.
- S\. Liu, C\. Lyu, M\. Wu, L\. Wang, W\. Luo, and K\. Zhang \(2025b\)New trends for modern machine translation with large reasoning models\.arXiv preprint arXiv:2503\.10351\.Cited by:[§4](https://arxiv.org/html/2604.19144#S4.p1.1)\.
- Q\. Ma, J\. Wei, O\. Bojar, and Y\. Graham \(2019\)Results of the wmt19 metrics shared task: segment\-level and strong mt systems pose big challenges\.InProceedings of the Fourth Conference on Machine Translation \(Volume 2: Shared Task Papers, Day 1\),pp\. 62–90\.Cited by:[Appendix A](https://arxiv.org/html/2604.19144#A1.p1.1)\.
- OpenAI \(2024\)Learning to reason with large language models\.Note:[https://openai\.com/index/learning\-to\-reason\-with\-llms/](https://openai.com/index/learning-to-reason-with-llms/)Cited by:[§1](https://arxiv.org/html/2604.19144#S1.p1.1),[§3\.2](https://arxiv.org/html/2604.19144#S3.SS2.p2.1),[§4](https://arxiv.org/html/2604.19144#S4.p1.1)\.
- T\. Qwen \(2024\)Qwq: reflect deeply on the boundaries of the unknown\.Hugging Face\.Cited by:[§3\.2](https://arxiv.org/html/2604.19144#S3.SS2.p3.1)\.
- R\. Rei, N\. M\. Guerreiro, J\. Pombal, D\. van Stigt, M\. Treviso, L\. Coheur, J\. G\. C\. de Souza, and A\. F\. T\. Martins \(2023\)Scaling up cometkiwi: unbabel\-ist 2023 submission for the quality estimation shared task\.External Links:2309\.11925,[Link](https://arxiv.org/abs/2309.11925)Cited by:[§3\.1](https://arxiv.org/html/2604.19144#S3.SS1.p2.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§2\.3\.2](https://arxiv.org/html/2604.19144#S2.SS3.SSS2.p2.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.External Links:2303\.11366,[Link](https://arxiv.org/abs/2303.11366)Cited by:[§4](https://arxiv.org/html/2604.19144#S4.p2.1)\.
- N\. Team, M\. R\. Costa\-jussà, J\. Cross, O\. Çelebi, M\. Elbayad, K\. Heafield, K\. Heffernan, E\. Kalbassi, J\. Lam, D\. Licht, J\. Maillard, A\. Sun, S\. Wang, G\. Wenzek, A\. Youngblood, B\. Akula, L\. Barrault, G\. M\. Gonzalez, P\. Hansanti, J\. Hoffman, S\. Jarrett, K\. R\. Sadagopan, D\. Rowe, S\. Spruit, C\. Tran, P\. Andrews, N\. F\. Ayan, S\. Bhosale, S\. Edunov, A\. Fan, C\. Gao, V\. Goswami, F\. Guzmán, P\. Koehn, A\. Mourachko, C\. Ropers, S\. Saleem, H\. Schwenk, and J\. Wang \(2022\)No language left behind: scaling human\-centered machine translation\.External Links:2207\.04672,[Link](https://arxiv.org/abs/2207.04672)Cited by:[§3\.1](https://arxiv.org/html/2604.19144#S3.SS1.p1.1)\.
- J\. Wang, F\. Meng, Y\. Liang, and J\. Zhou \(2024a\)Drt\-o1: optimized deep reasoning translation via long chain\-of\-thought\.arXiv e\-prints,pp\. arXiv–2412\.Cited by:[Appendix A](https://arxiv.org/html/2604.19144#A1.p1.1),[Appendix B](https://arxiv.org/html/2604.19144#A2.p1.1),[§1](https://arxiv.org/html/2604.19144#S1.p2.1),[§3\.1](https://arxiv.org/html/2604.19144#S3.SS1.p2.1),[§3\.2](https://arxiv.org/html/2604.19144#S3.SS2.p4.1),[§4](https://arxiv.org/html/2604.19144#S4.p1.1)\.
- J\. Wang, F\. Meng, and J\. Zhou \(2025a\)Deep reasoning translation via reinforcement learning\.arXiv preprint arXiv:2504\.10187\.Cited by:[Appendix B](https://arxiv.org/html/2604.19144#A2.p1.1),[§3\.2](https://arxiv.org/html/2604.19144#S3.SS2.p4.1),[§4](https://arxiv.org/html/2604.19144#S4.p1.1)\.
- J\. Wang, F\. Meng, and J\. Zhou \(2025b\)Extrans: multilingual deep reasoning translation via exemplar\-enhanced reinforcement learning\.arXiv preprint arXiv:2505\.12996\.Cited by:[Appendix B](https://arxiv.org/html/2604.19144#A2.p1.1),[§1](https://arxiv.org/html/2604.19144#S1.p2.1),[§3\.2](https://arxiv.org/html/2604.19144#S3.SS2.p4.1),[§4](https://arxiv.org/html/2604.19144#S4.p1.1)\.
- Y\. Wang, J\. Zeng, X\. Liu, F\. Meng, J\. Zhou, and M\. Zhang \(2024b\)TasTe: teaching large language models to translate through self\-reflection\.External Links:2406\.08434,[Link](https://arxiv.org/abs/2406.08434)Cited by:[§4](https://arxiv.org/html/2604.19144#S4.p2.1)\.
- Y\. Weng, M\. Zhu, F\. Xia, B\. Li, S\. He, S\. Liu, B\. Sun, K\. Liu, and J\. Zhao \(2023\)Large language models are better reasoners with self\-verification\.External Links:2212\.09561,[Link](https://arxiv.org/abs/2212.09561)Cited by:[§4](https://arxiv.org/html/2604.19144#S4.p2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§3\.2](https://arxiv.org/html/2604.19144#S3.SS2.p3.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\. 5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§2\.3\.2](https://arxiv.org/html/2604.19144#S2.SS3.SSS2.p1.1),[§3\.2](https://arxiv.org/html/2604.19144#S3.SS2.p2.1)\.
- Y\. Zhao, H\. Yin, B\. Zeng, H\. Wang, T\. Shi, C\. Lyu, L\. Wang, W\. Luo, and K\. Zhang \(2024\)Marco\-o1: towards open reasoning models for open\-ended solutions\.arXiv preprint arXiv:2411\.14405\.Cited by:[§1](https://arxiv.org/html/2604.19144#S1.p2.1),[§3\.2](https://arxiv.org/html/2604.19144#S3.SS2.p3.1),[§4](https://arxiv.org/html/2604.19144#S4.p1.1)\.

## Appendix AOur Data

Our dataset is constructed by randomly sampling from WMT19Maet al\.\([2019](https://arxiv.org/html/2604.19144#bib.bib10)\), SlangDITLianget al\.\([2025](https://arxiv.org/html/2604.19144#bib.bib17)\), and MetaphorTransWanget al\.\([2024a](https://arxiv.org/html/2604.19144#bib.bib30)\)\. We employ DeepSeek\-V3 to generate the corresponding translations and reflections\. Subsequently, we categorize the difficulty of these samples based on their final GRF scores and filter the data accordingly\. The resulting dataset maintains an approximate 1:1 ratio of simple to difficult samples\. The detailed data distribution illustrated in Table[4](https://arxiv.org/html/2604.19144#A4.T4)\. By incorporating these difficult samples, we aim to fully elicit the model’s reasoning capabilities, thereby encouraging it to learn in\-depth reflection and idiomatic translation\.

During the Cold Start Supervised Fine\-Tuning \(SFT\) stage, we train the model using the complete samples—comprising the initial translation, reflection analysis, and final translation—to help it internalize the entire reflective translation pipeline\. Conversely, in the Reinforcement Learning \(RL\) stage, we provide only the source English sentences as input, without any reference translations or annotations\. For data generation, we locally deploy DeepSeek\-V3 \(671B\) with the temperature set to 0\.1 and top\-ppto 0\.95\. Furthermore, the hyperparameters in Section[2\.2](https://arxiv.org/html/2604.19144#S2.SS2)\(3\) are set as follows:θ=90\\theta=90,Km​a​x=5K\_\{max\}=5\.

To ensure data quality, we apply rigorous filtering to exclude: \(1\) samples with invalid formats; \(2\) samples with consistently low scores \(rK<40r\_\{K\}<40\); and \(3\) samples exhibiting performance degradation after reflection \(rK<r0r\_\{K\}<r\_\{0\}\)\.

The initial format of our reflective translation dataset is as follows:

\{”src”: ”The big losers from this economic disaster are workers in the advanced countries that bought into the laissez\-faire flexibility of American\-style capitalism\.”,”ref”: ”这场经济灾难的大输家是那些认同美式资本主义放任自流政策的发达国家的工人。”,”final\_translation”: ”在这场经济灾难中遭受重大损失的是那些接受了美国式资本主义自由放任灵活性的发达国家的工人。”,”total\_rounds”: 1,”final\_score”: 90,”best\_round”: 1,”history”: \[\{”round”: 1,”translation”: ”在这场经济灾难中遭受重大损失的是那些信奉美国式资本主义自由放任灵活性的发达国家的工人。”,

”reflection”: ”1\. 翻译质量分析: \- 准确性:翻译基本准确传达了原文的意思,特别是“economic disaster”(经济灾难)、“big losers”(遭受重大损失)和“American\-style capitalism”(美国式资本主义)等关键术语的处理得当。 \- 流畅性:译文整体较为流畅,符合中文表达习惯。句子结构清晰,逻辑连贯。 \- 语境适配:译文很好地适应了原文的经济学背景,尤其是对“laissez\-faire flexibility”(自由放任灵活性)的解释和处理,符合经济学领域的常用表达。 \- 过度翻译:没有明显的过度翻译现象,所有信息点都得到了恰当的传达。2\. 存在的问题: \- “bought into”这一短语的翻译略显生硬,直译为“信奉”虽然可以理解,但可能不够精准。“bought into”在这里更强调对这些理念的接受或认同,而非单纯的“信奉”。3\. 改进建议: \- 可以将“bought into”翻译为“接受了”或“认同了”,以更准确地传达原文的含义。例如:“在这场经济灾难中遭受重大损失的是那些接受了美国式资本主义自由放任灵活性的发达国家的工人。”,”score”: 90\}\]\}\}

The data format for cold start SFT is as follows:

¡answer¿2018年非洲面临的三大人道主义挑战¡/answer¿ ¡reflection¿1\. 翻译质量分析:\- 准确性:翻译准确传达了原文的核心信息,”Three Humanitarian Challenges”被准确地译为”三大人道主义挑战”,”for Africa in 2018”也被正确地处理为”2018年非洲面临的”。 \- 流畅性:译文符合中文表达习惯,语句通顺自然。 \- 语境适配:译文很好地适应了新闻报道或学术文章的语境,保持了正式和客观的语气。 \- 过度翻译:未发现过度翻译现象,译文简洁明了。2\. 存在的问题: \- 无显著问题。译文在各方面都达到了较高的标准。3\. 改进建议: \- 虽然译文已经很完美,但可以考虑在特定语境下增加一些修饰词以增强感染力,例如:”2018年非洲亟待解决的三大紧迫人道主义挑战”。这种调整可以根据具体使用场景灵活选择。”¡/reflection¿ ¡need\_revision¿否¡/need\_revision¿ ¡final\_answer¿2018年非洲面临的三大人道主义挑战¡/final\_answer¿

## Appendix BEvaluation Prompt

The evaluation prompt of GRF borrows fromKocmi and Federmann \([2023](https://arxiv.org/html/2604.19144#bib.bib54)\), and is also employed inWanget al\.\([2024a](https://arxiv.org/html/2604.19144#bib.bib30),[2025a](https://arxiv.org/html/2604.19144#bib.bib49),[2025b](https://arxiv.org/html/2604.19144#bib.bib16)\):

Score the following translation from\[src lang\]to\[trg lang\]on a continuous scale from 0 to 100, where score of zero means ”no meaning preserved” and score of one hundred means ”perfect preservation of meaning, with faithfulness, expressiveness, and elegance”\.\[src lang\]source: \{src\}\[trg lang\]translation: \{hyp\}Score:

The prompt evaluates a translation from a general perspective, and achieves a high correlation with humansKocmi and Federmann \([2023](https://arxiv.org/html/2604.19144#bib.bib54)\)\.

## Appendix CImplementation Details

Cold Start SFT\.We employed the LLaMA\-Factory framework for Supervised Fine\-Tuning \(SFT\)\. The model was trained for 3 epochs with a learning rate of1×10−41\\times 10^\{\-4\}and a batch size of 1 with 8 gradient accumulation steps\. The entire SFT process required approximately 3 GPU hours\.

RL Training\.We implemented the GRPO algorithm using theverl333[https://github\.com/volcengine/verl](https://github.com/volcengine/verl)library\. We locally deployed DeepSeek\-V3 to serve as the reward model for optimizing the policy model\. For the training configuration, we set the global batch size to 64, the learning rate to2×10−72\\times 10^\{\-7\}, the number of rollouts to 8, the rollout temperature to 0\.7, and the KL coefficient to5×10−35\\times 10^\{\-3\}\. The training was conducted for 2 epochs, consuming a total of approximately 770 GPU hours\.

Hyperparameters\.Based on empirical results, we setη\\etaandμ\\muto 5 and 0\.2, respectively\. In Stage 1, the reward weights were set as follows:wform=0\.15w\_\{\\text\{form\}\}=0\.15,wtrans=0\.45w\_\{\\text\{trans\}\}=0\.45,wrefl=0\.2w\_\{\\text\{refl\}\}=0\.2, andwimp=0\.2w\_\{\\text\{imp\}\}=0\.2\. In Stage 2, we adjusted these weights towform=0\.15w\_\{\\text\{form\}\}=0\.15,wtrans=0\.75w\_\{\\text\{trans\}\}=0\.75,wrefl=0\.05w\_\{\\text\{refl\}\}=0\.05, andwimp=0\.05w\_\{\\text\{imp\}\}=0\.05\.

Evaluation\.To evaluate the baseline models, we obtained open\-source models from Hugging Face and accessed closed\-source models via their official APIs\. We utilized the official implementations and model checkpoints to compute the COMETKIWI\-XL and MetricX metrics\. Similarly, the GRF scores were obtained via API calls\.

## Appendix DOur Adapt thinking Baseline

Our pre\-thinking baseline uses training data aligned with our method, employing DeepSeek\-V3 to generate pre\-thinking data for SFT cold start\. The data format is as follows:

¡think¿首先,”kind of”是一个常见的俚语表达,通常用来表示某种程度的不确定性或轻微的程度。然后,”Riverdale”是一个专有名词,但在这个上下文中,它可能指的是一个地方名,也可能是电视剧《Riverdale》中的Riverdale镇。如果是指地点,那么“his last meal in Riverdale”可以理解为他在Riverdale的最后一次用餐;如果是指电视剧,则可能是在描述剧中人物即将离开或结束在Riverdale的生活。考虑到俚语的使用,这句句子更有可能出自一种非正式的语境,比如电视剧的讨论或者个人的情感表达。因此,我将假设此句出自电视剧的讨论背景。¡/think¿¡answer¿有点希望这是他在Riverdale的最后一顿饭¡/answer¿

In datasets where no reflection is required, the ‘¡think¿¡/think¿‘ tag remains empty\. The ratio of reflective to non\-reflective data is set at 2:3\. Using LLaMA\-Factory for fine\-tuning and verl for GRPO training, we cultivated a model capable of adaptively deciding whether to engage in reflection before generating a response\. This adaptive reflection method allows the model to avoid excessive contemplation on simple tasks, thereby preventing performance degradation\. Furthermore, it effectively reduces token consumption, enhancing overall efficiency\.

SplitEN→\\rightarrowZHZH→\\rightarrowENTrain11,3898,000Test2,0002,000Val1,0001,000Table 4:Dataset statistics for translation tasks\.
## Appendix EGPT\-4o Score

ModelGRFMX24CKTokenGPT\-4o77\.622\.6377\.9728\.57ReflectMT\-Stage2 \(Our\)82\.552\.0078\.0530\.73

Table 5:Performance metrics for GPT\-4o and ReflectMT\-Stage2\.Due to the costs associated with using the GPT\-4o API, we limited our testing of GPT\-4o to our custom dataset\. As shown in Table[5](https://arxiv.org/html/2604.19144#A5.T5), ReflectMT achieves higher GRF scores compared to GPT\-4o, while maintaining similar token consumption\.

Similar Articles

Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

arXiv cs.CL

Translate-R1 introduces a reinforcement learning approach for cost-aware translation tool use in LLMs, where the model learns to decide when to translate inputs based on its own comprehension and a cost-sensitivity parameter, achieving Pareto-optimal trade-offs across multiple languages.

Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

Hugging Face Daily Papers

This paper introduces Reflective Masking, a lightweight post-training method that enables mask diffusion models to perform multi-turn self-revision through token-level revision policies and history references, improving performance on reasoning tasks like Sudoku, math, code generation, and image editing.