Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization

arXiv cs.AI Papers

Summary

This paper proposes CTO, a method that improves code translation by combining syntax-guided and semantic-aware preference optimization through contrastive learning and direct preference optimization, achieving significant improvements over existing baselines in C++, Java, and Python translations.

arXiv:2605.13229v1 Announce Type: new Abstract: LLMs have shown immense potential for code translation, yet they often struggle to ensure both syntactic correctness and semantic consistency. While preference-based learning offers a promising alignment strategy, it is hindered by unreliable semantic rewards derived from sparse test cases or restrictive reference translations. We argue that a robust semantic reward for code translation must be derived directly from the source code. In this paper, we propose CTO to improve code translation with syntax-guided and semantic-aware preference optimization. Through contrastive learning, we train a cross-lingual semantic model to directly assess functional equivalence between source and translated code. By formulating code translation as a multi-objective optimization problem, this robust semantic signal is seamlessly unified with compiler-based syntactic feedback within the direct preference optimization framework. Extensive experiments on C++, Java, and Python translations demonstrate that CTO significantly outperforms existing baselines and alternative preference optimization strategies.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:15 AM

# Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization
Source: [https://arxiv.org/html/2605.13229](https://arxiv.org/html/2605.13229)
Huan Zhang1Wei Cheng1Chen Shen1Jingyue Yang1&Wei Hu1,2, 1State Key Laboratory for Novel Software Technology, Nanjing University, China 2National Institute of Healthcare Data Science, Nanjing University, China \{yhwu, zhanghuan, wchengcs, cshen, jyyang\}\.nju@gmail\.com, whu@nju\.edu\.cnCorresponding author

###### Abstract

LLMs have shown immense potential for code translation, yet they often struggle to ensure both syntactic correctness and semantic consistency\. While preference\-based learning offers a promising alignment strategy, it is hindered by unreliable semantic rewards derived from sparse test cases or restrictive reference translations\. We argue that a robust semantic reward for code translation must be derived directly from the source code\. In this paper, we propose CTO to improvecodetranslation with syntax\-guided and semantic\-aware preferenceoptimization\. Through contrastive learning, we train a cross\-lingual semantic model to directly assess functional equivalence between source and translated code\. By formulating code translation as a multi\-objective optimization problem, this robust semantic signal is seamlessly unified with compiler\-based syntactic feedback within the direct preference optimization framework\. Extensive experiments on C\+\+, Java, and Python translations demonstrate that CTO significantly outperforms existing baselines and alternative preference optimization strategies\.

## 1Introduction

Code translation, the process of migrating functionality from a source programming language to a target language, is a crucial task in modern software engineering\. It promotes code reuse, facilitates legacy system modernization, and enables cross\-language interoperability in an increasingly heterogeneous software ecosystemNguyenet al\.\([2014](https://arxiv.org/html/2605.13229#bib.bib17)\); Zhuet al\.\([2022a](https://arxiv.org/html/2605.13229#bib.bib8)\); Yanet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib11)\)\. While recent advances in pre\-trained language models have shown immense potentialLuet al\.\([2021](https://arxiv.org/html/2605.13229#bib.bib19)\); Zhuet al\.\([2022b](https://arxiv.org/html/2605.13229#bib.bib7)\); Zhenget al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib20)\); Roziereet al\.\([2020](https://arxiv.org/html/2605.13229#bib.bib1)\), their practical application is frequently hindered by a key challenge: ensuring that the translated code is not only syntactically correct but also semantically equivalent to the source\. Even state\-of\-the\-art large language models \(LLMs\) often make errors due to inadequate understanding of the syntactic and semantic nuances across different programming languagesPanet al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib2)\); Yanget al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib12)\)\.

![Refer to caption](https://arxiv.org/html/2605.13229v1/x1.png)Figure 1:A motivating example illustrating the entanglement of syntax and semantics in execution\-based reward\.Target C\+\+ Iis syntactically valid but semantically flawed; however, it exploits sparse test cases \(whereA\[0\]happens to be the minimum\) to achieve a false positive pass \(reward hacking\)\. Conversely,Target C\+\+ IImaintains semantic equivalence \(usingmin\_element\) but receives a zero reward due to a compilation error\.To address this, a natural evolution is to move beyond standard supervised finetuning and adopt preference\-based learning\. By leveraging reward signals derived from user judgments or task\-specific criteria, reinforcement learning from human feedback \(RLHF\)Ouyanget al\.\([2022](https://arxiv.org/html/2605.13229#bib.bib38)\)has been widely employed to align model outputs with human preferences\. However, applying this paradigm to code translation raises a fundamental question:How can we define and obtain reliable reward signals for both syntactic correctness and semantic consistency?For syntax, compiler feedback serves as an infallible oracle, providing a deterministic and binary signal of syntactic correctnessShojaeeet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib13)\)\. For semantics, however, the path is fraught with challenges\.

As illustrated in Figure[1](https://arxiv.org/html/2605.13229#S1.F1), prevailing strategies for semantic rewards are built on flawed foundations\. The most common strategy is to derive rewards from test cases, treating test outcomes as a proxy for functional correctnessLeet al\.\([2022](https://arxiv.org/html/2605.13229#bib.bib43)\); Geeet al\.\([2025](https://arxiv.org/html/2605.13229#bib.bib47)\); Zhanget al\.\([2025](https://arxiv.org/html/2605.13229#bib.bib46)\)\. However, real\-world test suites are often sparse and exhibit low coverage, which can lead to reward hackingMaet al\.\([2025](https://arxiv.org/html/2605.13229#bib.bib48)\), where the model may overfit to these limited test cases without preserving the intended semantics of the source code\. In the context of code translation, relying on test cases leads to a critical entanglement of syntax and semantics\. A single syntactical divergence in the target language often invalidates the entire execution, yielding a zero pass rate regardless of semantic accuracy\. Consequently, the model receives a binary reward that fails to decouple semantic correctness from syntactic validity, making it impossible to quantify the magnitude of semantic deviation\. An alternative is to use the reference translation as a semantic anchor\. However, this approach gives rise to two flawed strategies\. The first strategy compares the generated code to the reference using text\-based similarity metrics such as CodeBLEURenet al\.\([2020](https://arxiv.org/html/2605.13229#bib.bib45)\)\. This strategy is fundamentally superficial, as it rewards lexical similarity over true functional equivalence\. The second, more sophisticated strategy introduces a monolingual semantic model to evaluate the similarity between candidate translations and the reference\. However, it implicitly assumes that the reference translation is a perfect and complete representation of the source semantics\. This assumption introduces asemantic bottleneckthat restricts output diversity and may even propagate errors in the reference\. We argue that a robust semantic reward should be disentangled from such flawed proxies and instead be derived directly from the source code itself\. We propose CTO, a novel approach that improvescodetranslation via syntax\-guided and semantic\-aware preferenceoptimization\.

We reformulate code translation as a multi\-objective preference optimization problem, explicitly modeling both syntax and semantics as distinct objectives\. It leverages compiler feedback to construct syntax\-aware preference pairs\. For semantic alignment, it trains a cross\-lingual semantic model via contrastive learning to assess functional equivalence directly between the source and translated candidates\. It injects the semantic reward difference of a preference pair directly into the learning process, biasing the preference strength to unify syntactic and semantic objectives within the direct preference optimization \(DPO\)Rafailovet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib6)\)framework\.

We conduct extensive experiments on pairwise code translation between C\+\+, Java, and Python\. Results demonstrate that CTO outperforms existing approaches and alternative preference optimization strategies, achieving a superior alignment with the intended functionality of the source code\.

To summarize, our main contributions are as follows:

- •We propose CTO, a novel approach that improves code translation via syntax\-guided and semantic\-aware preference optimization\. It introduces a reward\-biasing mechanism within DPO, enabling unified optimization of both syntactic correctness and semantic consistency\.
- •We train a cross\-lingual semantic model as a robust reward source\. It directly evaluates functional equivalence between the source code and translation candidates, overcoming the fundamental limitations of reward signals derived from other proxies\.
- •Our experiments show that CTO improves translation accuracy by up to 3\.66% on the TransCoder\-Test dataset and 4\.27% on HumanEval\-X with the finetuned CodeT5 model\. With a larger finetuned CodeLlama\-7B model, it yields accuracy gains of 5\.60% and 6\.70%, respectively\.

## 2Related Work

#### Code Translation\.

Early worksRoziereet al\.\([2020](https://arxiv.org/html/2605.13229#bib.bib1),[2022](https://arxiv.org/html/2605.13229#bib.bib3)\)on code translation primarily focus on self\-supervised learning\. Building upon this, several worksSzafraniecet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib4)\); Huanget al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib16)\); Liuet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib15)\)explore incorporating code structure information to enhance code representations for unsupervised code translation\. These methods demonstrate the feasibility of learning cross\-language representations without parallel data\. Despite their effectiveness, unsupervised approaches often struggle with collecting enormous amounts of code corpora and high computational resource consumption\.

With the availability of high\-quality parallel datasetsZhuet al\.\([2022a](https://arxiv.org/html/2605.13229#bib.bib8)\); Yanet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib11)\); Khanet al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib31)\); Yanet al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib32)\)and pre\-trained code modelsWanget al\.\([2021](https://arxiv.org/html/2605.13229#bib.bib24)\); Guoet al\.\([2021](https://arxiv.org/html/2605.13229#bib.bib25)\); Zhenget al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib20)\); Fenget al\.\([2020](https://arxiv.org/html/2605.13229#bib.bib30)\), supervised finetuning has become a paradigm for code translation\. Building on this foundation, recent studies explore incorporating reinforcement learning \(RL\) techniquesShojaeeet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib13)\)and variational inference techniquesDuet al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib14)\)to further enhance the performance\. Besides, a few recent worksIbrahimzadaet al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib41)\); Wanget al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib40)\)focus on repository\-level code translation, aiming to facilitate complex codebase migration\. Our work follows the supervised finetuning paradigm, further advancing code translation performance without relying on any auxiliary datasets\.

#### Reinforcement Learning from Human Feedback\.

RLHF has significantly improved the performance of downstream tasks by aligning models with human preferences\. While the reward\-based methods like proximal policy optimizationSchulmanet al\.\([2017](https://arxiv.org/html/2605.13229#bib.bib10)\)and group relative policy optimizationShaoet al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib23)\)demonstrate success, they suffer from increased memory overhead and substantial computational burdens due to their requirement for a large number of samples during each policy update cycle\.

Reward\-free methods address these limitations from a different perspective\. Direct preference optimizationRafailovet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib6)\)eliminates the need for an explicit reward model by implicitly estimating reward signals from preference data\. Identity preference optimizationAzaret al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib21)\)introduces a general non\-decreasing bounded function to mitigate overfitting, while simple preference optimizationMenget al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib22)\)further simplifies the process by removing the reference model altogether\. Additionally, some studiesParket al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib39)\); Menget al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib22)\)explore the regularization terms as a margin to improve reward modeling\. Despite these advancements, directly applying these techniques to code translation remains suboptimal due to inadequate syntax representation and incomplete semantic modeling\. To address this challenge, CTO enhances preference optimization by jointly incorporating syntax and semantics, leading to improved accuracy and robustness in code translation\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2605.13229v1/x2.png)Figure 2:Overview of our CTO\.### 3\.1Reward Model Training

The overall reward comprises both syntactic and semantic components\. The syntactic reward is readily accessible, as it can be directly derived from the compiler feedback\. This binary and rule\-based signal indicates whether the translation code adheres to the target language’s grammar, following standard practice in prior workShojaeeet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib13)\)\.

In contrast, designing an effective semantic reward presents a more challenging problem, as code semantics encompass not only functional equivalence, but also logical fidelity to the source code’s intent\. Prior studies have approached this from different perspectives\. Execution\-based reward like CodeRLLeet al\.\([2022](https://arxiv.org/html/2605.13229#bib.bib43)\)utilizes discrete, rule\-based heuristics to evaluate semantic correctness, primarily relying on unit tests\. However, test cases are often scarce or even unavailable in function\- or program\-level translation training data, resulting in unstable execution\-based reward signals\. Furthermore, such reward is inherently binary and sparse, making it difficult to quantify the proximity of an incorrect translation to the correct logic\. Reference\-based reward like PPOCoderShojaeeet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib13)\)employs reward signals derived from static program representations, such as abstract syntax trees and dataflow graphs, which often depend on comparisons with ground\-truth reference implementations\. The reliance on reference translations restricts their applicability in real\-world scenarios, where such references are often incomplete or entirely absent\. Moreover, minor code variations, such as variable renaming, can lead to significant runtime or functional errors, which are often not adequately penalized by textual metrics\.

#### Semantic Reward Model\.

To address these issues, we propose to assess semantic correctness by measuring semantic distance within a high\-dimensional latent space\. Formally, our objective is to learn a mapping functionf:𝒳∪𝒴→𝒵f:\\mathcal\{X\}\\cup\\mathcal\{Y\}\\to\\mathcal\{Z\}such that the geometric proximity in𝒵\\mathcal\{Z\}reflects the functional equivalence between the source codexxand the target codeyy\. To shape this latent space, we construct a source\-anchored triplet𝒯=\(x,y\+,𝒴−\)\\mathcal\{T\}=\(x,y^\{\+\},\\mathcal\{Y\}^\{\-\}\)for each training instance\. We designate the source codexxas the fixed anchor point in the latent space\. It serves as the semantic oracle which all translation candidates are measured\. The reference translationy\+y^\{\+\}is defined as the positive sample and𝒴−=\{y1−,…,yK−\}\\mathcal\{Y\}^\{\-\}=\\\{y^\{\-\}\_\{1\},\\dots,y^\{\-\}\_\{K\}\\\}is semantically divergent negatives\. The optimization goal is to pull the embeddingf​\(y\+\)f\(y^\{\+\}\)into the immediateϵ\\epsilon\-neighborhood of the anchorf​\(x\)f\(x\), while explicitly repelling the hard negatives in𝒴−\\mathcal\{Y\}^\{\-\}beyond this neighborhood boundary, thereby establishing a robust semantic margin against subtle deviations\.

#### Training Objective\.

Our semantic reward model adopts a dual\-encoder architecture, which contains a source code encoder and a target code encoder, sharing the same parameters\. To effectively learn semantic similarity for cross\-lingual code representation, we use the InfoNCE lossvan den Oordet al\.\([2018](https://arxiv.org/html/2605.13229#bib.bib42)\)as the training objective:

ℒf=−log⁡exp\(cos\(f​\(𝐱\),f​\(𝐲\+\)τ\)exp\(cos\(f​\(𝐱\),f​\(𝐲\+\)τ\)\+∑i=1Kexp\(cos\(f​\(𝐱\),f​\(𝐲i−\)τ\),\\mathcal\{L\}\_\{f\}=\-\\log\\frac\{\\exp\\left\(\\cos\(\\frac\{f\(\\mathbf\{x\}\),f\(\\mathbf\{y\}^\{\+\}\)\}\{\\tau\}\\right\)\}\{\\exp\\left\(\\cos\(\\frac\{f\(\\mathbf\{x\}\),f\(\\mathbf\{y\}^\{\+\}\)\}\{\\tau\}\\right\)\+\\sum\\limits\_\{i=1\}^\{K\}\\exp\\left\(\\cos\(\\frac\{f\(\\mathbf\{x\}\),f\(\\mathbf\{y\}^\{\-\}\_\{i\}\)\}\{\\tau\}\\right\)\},

\(1\)where𝐱\\mathbf\{x\}denotes the source code,𝐲\+\\mathbf\{y\}^\{\+\}and𝐲i−\\mathbf\{y\}^\{\-\}\_\{i\}denotes reference translation and negative translations respectively\.f​\(⋅\)f\(\\cdot\)is the encoder that maps code snippets to semantic representations,cos⁡\(⋅,⋅\)\\cos\(\\cdot,\\cdot\)denotes cosine similarity, andτ\\tauis the temperature parameter\.

#### Dataset for Training\.

The positive samples come from the reference code in the supervised dataset\. To construct negative samples for training the semantic reward model, we employ an LLM as a perturbation generator\. Specifically, the LLM rewrites the reference code by introducing subtle perturbations that alter the intended semantics\. These perturbed variants, which often remain syntactically valid but semantically flawed, making them ideal negative examples\. Together with the original reference translations, they form the training dataset for learning cross\-lingual semantic alignment\.

#### Reward Score Function\.

Based on the semantic model, we encode both the source and candidate target code into embeddings within a shared vector space\. For a set of candidate translationsy=\{y1,…,yn\}y=\\\{y\_\{1\},\\dots,y\_\{n\}\\\}in the target language, we compute the cosine similarity between each candidateyiy\_\{i\}and the source codexx\. The similarity scores are transformed into logit values and standardized via z\-score normalization to produce the final semantic reward:

rs​\(yi\)=si−μ​\(s1,…,sn\)σ​\(s1,…,sn\),si=log⁡cos⁡\(f​\(x\),f​\(yi\)\)1−cos⁡\(f​\(x\),f​\(yi\)\),\\displaystyle\\begin\{split\}r\_\{s\}\(y\_\{i\}\)&=\\frac\{s\_\{i\}\-\\mu\(s\_\{1\},\\dots,s\_\{n\}\)\}\{\\sigma\(s\_\{1\},\\dots,s\_\{n\}\)\},\\\\ s\_\{i\}&=\\log\\frac\{\\cos\(f\(x\),f\(y\_\{i\}\)\)\}\{1\-\\cos\(f\(x\),f\(y\_\{i\}\)\)\},\\end\{split\}\(2\)whereμ\\muandσ\\sigmarepresent the mean and standard deviation of the scoress1,…,sn\{s\_\{1\},\\dots,s\_\{n\}\}, respectively\. This function enables list\-wise scoring by capturing the semantic similarity between the source code and each candidate translation\.

### 3\.2Preference Optimization

#### Problem Formulation\.

We formulate code translation as a multi\-objective optimization problem over syntax and semantics, and apply a linear scalarization strategy to integrate the multiple objectives into a unified optimization goal\. We denote the syntax objective asgg, the semantics objective asss, and the preference dataset for code translation as𝒟=\{𝒟g,𝒟s\}\\mathcal\{D\}=\\\{\\mathcal\{D\}\_\{g\},\\mathcal\{D\}\_\{s\}\\\}, where𝒟g\\mathcal\{D\}\_\{g\}and𝒟s\\mathcal\{D\}\_\{s\}denote the syntactic preference dataset and semantic preference dataset, respectively\. Given the dataset𝒟\\mathcal\{D\}, we define the oracle reward model as a weighted combination of syntax and semantic rewards:

𝐫∗​\(𝐱,𝐲\)=w⋅rg∗​\(𝐱,𝐲\)\+\(1−w\)⋅rs∗​\(𝐱,𝐲\),\\mathbf\{r\}^\{\*\}\(\\mathbf\{x\},\\mathbf\{y\}\)=w\\cdot r^\{\*\}\_\{g\}\(\\mathbf\{x\},\\mathbf\{y\}\)\+\(1\-w\)\\cdot r^\{\*\}\_\{s\}\(\\mathbf\{x\},\\mathbf\{y\}\),\(3\)whererg∗​\(𝐱,𝐲\)r^\{\*\}\_\{g\}\(\\mathbf\{x\},\\mathbf\{y\}\)andrs∗​\(𝐱,𝐲\)r^\{\*\}\_\{s\}\(\\mathbf\{x\},\\mathbf\{y\}\)evaluate the syntax and semantic rewards of the translated target code, respectively\.w∈\[0,1\]w\\in\[0,1\]is a weight parameter to balance syntactic and semantic preferences\. In contrast to the classical Pareto optimization, which seeks to approximate the Pareto front of multiple objectives, we focus on optimizing the code translation model under a specific preference weightingw=0\.5w=0\.5\. This choice represents an equal prioritization of syntactic and semantic objectives\.

This leads to the following training objective for the translation modelπθ\\pi\_\{\\theta\}:

maxπθ⁡𝔼𝐱,𝐲∼π\(⋅\|𝐱\)​\[𝐫∗​\(𝐱,𝐲\)−β​log⁡πθ​\(𝐲\|𝐱\)πsft​\(𝐲\|𝐱\)\],\\max\_\{\\pi\_\{\\theta\}\}\\;\\mathbb\{E\}\_\{\\mathbf\{x\},\\mathbf\{y\}\\sim\\pi\(\\cdot\\,\|\\,\\mathbf\{x\}\)\}\\left\[\\mathbf\{r\}^\{\*\}\(\\mathbf\{x\},\\mathbf\{y\}\)\-\\beta\\log\\frac\{\\pi\_\{\\theta\}\(\\mathbf\{y\}\\,\|\\,\\mathbf\{x\}\)\}\{\\pi\_\{\\text\{sft\}\}\(\\mathbf\{y\}\\,\|\\,\\mathbf\{x\}\)\}\\right\],\(4\)whereπsft\\pi\_\{\\text\{sft\}\}is the supervised finetuned model serving as the reference policy,β\\betais a regularization coefficient controlling deviation from the reference model\.

While previous worksShojaeeet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib13)\)have leveraged RL for reward\-based finetuning, such approaches tend to be unstable and resource\-intensive\. We propose an RL\-free variant of DPO, designed to integrate syntax and semantics into a unified training process efficiently\.

#### Derivation of CTO\.

The optimal policy under the RLHF objective \(Eq\. \([4](https://arxiv.org/html/2605.13229#S3.E4)\)\) takes the form:

πθ∗​\(y\|x\)=1Z​\(x\)​πsft​\(y\|x\)​exp⁡\(𝐫∗​\(x,y\)β\),\\displaystyle\\pi^\{\*\}\_\{\\theta\}\(y\\,\|\\,x\)=\\frac\{1\}\{Z\(x\)\}\\pi\_\{\\text\{sft\}\}\\left\(y\\,\|\\,x\\right\)\\exp\\left\(\\frac\{\\mathbf\{r\}^\{\*\}\(x,y\)\}\{\\beta\}\\right\),\(5\)where the partition function is defined as:Z​\(x\)=∑yπsft​\(y\|x\)​exp⁡\(1β​\(w⋅rg∗​\(x,y\)\+\(1−w\)⋅rs∗​\(x,y\)\)\)Z\(x\)=\\sum\_\{y\}\\pi\_\{\\text\{sft\}\}\(y\\,\|\\,x\)\\exp\\left\(\\frac\{1\}\{\\beta\}\\left\(w\\cdot r^\{\*\}\_\{g\}\(x,y\)\+\(1\-w\)\\cdot r^\{\*\}\_\{s\}\(x,y\)\\right\)\\right\)\.

Let𝒟k=\{\(x,yw,yl\)\}\\mathcal\{D\}\_\{k\}=\\\{\(x,y\_\{w\},y\_\{l\}\)\\\}denote the pairwise preference dataset under objectivek∈\{g,s\}k\\in\\\{g,s\\\}, whereyw≻yly\_\{w\}\\succ y\_\{l\}indicates thatywy\_\{w\}is preferred overyly\_\{l\}for inputxx\. The pairwise preference likelihood under the oracle reward model is given by:

p𝒟k​\(yw≻yl∣x\)=σ​\(𝐫∗​\(x,yw\)−𝐫∗​\(x,yl\)\),p\_\{\\mathcal\{D\}\_\{k\}\}\(y\_\{w\}\\succ y\_\{l\}\\mid x\)=\\sigma\\left\(\\mathbf\{r\}^\{\*\}\(x,y\_\{w\}\)\-\\mathbf\{r\}^\{\*\}\(x,y\_\{l\}\)\\right\),\(6\)whereσ​\(⋅\)\\sigma\(\\cdot\)denotes the sigmoid function\.

Substituting Eq\. \([5](https://arxiv.org/html/2605.13229#S3.E5)\) into Eq\. \([6](https://arxiv.org/html/2605.13229#S3.E6)\) and approximating the ground\-truth rewardr−k∗​\(x,y\)r^\{\*\}\_\{\-k\}\(x,y\)from the complementary objective−k\-kusing a learned reward modelr−k​\(x,y\)r\_\{\-k\}\(x,y\), we derive the final training objective of CTO:

ℒCTO=−𝔼\(x,yw,yl\)∼𝒟k\[logσ\(βw\(logπθ​\(yw\|x\)πsft​\(yw\|x\)\\displaystyle\\mathcal\{L\}\_\{\\text\{CTO\}\}=\-\\,\\mathbb\{E\}\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim\\mathcal\{D\}\_\{k\}\}\\bigg\[\\log\\sigma\\bigg\(\\frac\{\\beta\}\{w\}\\Big\(\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\\,\|\\,x\)\}\{\\pi\_\{\\text\{sft\}\}\(y\_\{w\}\\,\|\\,x\)\}−logπθ​\(yl\|x\)πsft​\(yl\|x\)\)−1−ww\(r−k\(x,yw\)−r−k\(x,yl\)\)\)\],\\displaystyle\\ \-\\log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\\,\|\\,x\)\}\{\\pi\_\{\\text\{sft\}\}\(y\_\{l\}\\,\|\\,x\)\}\\Big\)\-\\frac\{1\-w\}\{w\}\\Big\(r\_\{\-k\}\(x,y\_\{w\}\)\-r\_\{\-k\}\(x,y\_\{l\}\)\\Big\)\\bigg\)\\bigg\],\(7\)where−k\-kdenotes the objective complementary tokk\(i\.e\., ifk=gk=g, then−k=s\-k=s, and vice versa\)\.

This formulation reveals the core mechanism of CTO: by leveraging preference data from one objective \(captured in𝒟k\\mathcal\{D\}\_\{k\}\) and integrating the explicit reward difference from the other \(r−kr\_\{\-k\}\) as a rectification term, we realize the simultaneous modeling of both syntactic and semantic objectives within a unified optimization process\.

#### Preference Dataset Construction\.

Constructing a high\-quality preference dataset is essential for effective model training\. Syntactic feedback is generally more stable and verifiable, as grammar correctness can be deterministically checked via compilation\. Such feedback serves as an oracle\-like signal that enables the reliable classification of candidate translations into preferred and less preferred sets\. In contrast, semantic feedback is inherently more ambiguous\. Although extended unit tests and semantic reward models offer approximate semantic judgments, determining the absolute correctness of a translation remains an undecidable problem\. Given this discrepancy in reliability, we build our preference dataset primarily based on syntactic correctness, where preferences between candidate translations are derived from deterministic compilation results\.

Table 1:CA@1 scores on the TransCoder\-Test and HumanEval\-X datasets\. “C”, “J”, and “P” denote C\+\+, Java, and Python, respectively\. The best scores are marked inbold\.As shown in Figure[2](https://arxiv.org/html/2605.13229#S3.F2), we begin with a supervised dataset comprising aligned source and target code pairs\. A base code LLM is finetuned on this dataset to improve its code translation capabilities\. For a source code snippetxx, we generate a set of candidate translations\{y1,y2,…,yn\}\\\{y\_\{1\},y\_\{2\},\\dots,y\_\{n\}\\\}using the finetuned model\. Each candidateyiy\_\{i\}is passed through a compiler to obtain a binary signal indicating whether the code compiles successfully\. Based on the compilation results, the candidates are partitioned into a pass set and a fail set, from which we construct a preference dataset\. Instead of selecting random pairs, we employgit\-diff111[https://git\-scm\.com/docs/git\-diff](https://git-scm.com/docs/git-diff)to identify the closest matching pair\(yw,yl\)\(y\_\{w\},y\_\{l\}\), ensuring that low\-gap pairs capture fine\-grained preferences, whereywy\_\{w\}comes from the pass set andyly\_\{l\}comes from the failed set\. We then incorporate scores from the semantic reward model as soft guidance signals\. Specifically, the semantic scores ofywy\_\{w\}andyly\_\{l\}are used during optimization to enrich the syntactic preference with semantic discrimination\.

This strategy leverages the determinism of syntactic preference labels while still benefiting from the fine\-grained semantic discrimination offered by the semantic reward model\. Such integration facilitates the simultaneous optimization of syntactic accuracy and semantic alignment\.

## 4Experiments and Results

### 4\.1Experiment Settings

#### Dataset Construction\.

Our supervised finetuning training set and preference dataset construction are based on the XLCoST datasetZhuet al\.\([2022a](https://arxiv.org/html/2605.13229#bib.bib8)\), which contains parallel snippet\-level and program\-level code for commonly used programming languages, with each sample accompanied by an example test case to verify functionality\. We collect the parallel program\-level data from the training and validation splits\. The supervised finetuning training set consists of 6,884 Java↔\\leftrightarrowC\+\+, 6,419 C\+\+↔\\leftrightarrowPython, and 7,278 Java↔\\leftrightarrowPython samples, with a validation set of 346, 323, and 376 samples for each pair, respectively\. After finetuning, we generate 10 candidate responses for each input using a sampling temperature of 0\.9\.

For the semantic model training dataset, we employ the Qwen3\-8B model to rewrite the reference code of XLCoST, thereby generating corresponding negative samples\. Because the inference is conducted on candidates sampled from the supervised finetuning model at the semantic model’s test time, rather than on the reference data itself, this avoids data leakage and ensures benchmarks remain entirely unseen by the semantic model\.

#### Benchmarks\.

To assess the performance of our CTO, we employ two benchmark datasets:

- •TransCoder\-TestRoziereet al\.\([2020](https://arxiv.org/html/2605.13229#bib.bib1)\)is a standard benchmark for unsupervised code translation, with tasks in Java, C\+\+, and Python \(482, 467, and 464 tasks, respectively\) and an average of 10 test cases per task\.
- •HumanEval\-XZhenget al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib20)\)is extended from the HumanEval benchmarkChenet al\.\([2021](https://arxiv.org/html/2605.13229#bib.bib34)\)and designed for cross\-lingual code correctness evaluation using automated test cases\. It includes 164 tasks, with 6\.9 test cases on average\.

#### Evaluation Metrics\.

Computational Accuracy at top\-K \(CA@K\) measures the proportion of tasks where at least one of the top\-K translations generated by the model passes all test cases\. We use CA@1, showing the model’s effectiveness in generating functionally equivalent code to the reference\.

Table 2:CA@1 scores of ablation study and alternative methods on the TransCoder\-Test and HumanEval\-X datasets\.
#### Baseline Methods\.

We compare CTO to two categories of code translation methods:

- •Unsupervised translationdoes not require parallel corpora for training\. Instead, it leverages self\-supervised learning techniques: \(1\)TransCoderRoziereet al\.\([2020](https://arxiv.org/html/2605.13229#bib.bib1)\)is an unsupervised model with approximately 100M parameters that leverages masked language modeling, denoising autoencoding, and back translation as its core pre\-training tasks\. \(2\)TransCoder\-STRoziereet al\.\([2022](https://arxiv.org/html/2605.13229#bib.bib3)\)extends TransCoder by incorporating automated unit tests to reduce noise in back translation, thereby improving model performance\.
- •Supervised translationrelies on parallel corpora with aligned source and target code pairs to learn a transformation between different programming languages: \(1\)CodeT5\-SFTis a supervised finetuned variant of CodeT5Wanget al\.\([2021](https://arxiv.org/html/2605.13229#bib.bib24)\)\. We choose CodeT5 as the base model due to its comparable scale \(770M\) and architecture to TransCoder and TransCoder\-ST, ensuring a balanced comparison\. \(2\)PPOCoderShojaeeet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib13)\)is built upon CodeT5\-SFT with proximal policy optimization \(PPO\)\. It incorporates compiler feedback and both syntactic and semantic match scores as reward signals to improve translation quality\. \(3\)CodeLlama\-7B\-SFTis a supervised finetuned variant of CodeLlama\-7BRozièreet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib28)\), which serves as a representative foundational model for code\-related tasks\. \(4\)Qwen2\.5\-Coder\-7B\-SFTis a supervised finetuned variant of Qwen2\.5\-Coder\-7BHuiet al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib50)\)\.

#### Implementation\.

For supervised finetuning, CodeT5 uses a learning rate of 5e\-5 and a batch size of 128, while CodeLlama\-7B and Qwen2\.5\-Coder\-7B use 3e\-5 and 32, respectively, with gradient accumulation for memory constraints\. For the semantic reward model, we employ Qwen3\-Embedding 0\.6B as the base model\. The training is conducted with a learning rate of 6e\-6 and a batch size of 32 for 5 epochs\. For preference optimization, we set the learning rate to 5e\-7 for CodeT5\-SFT, and 5e\-5 for both CodeLlama\-SFT and Qwen2\.5\-Coder\-7B\. For CodeLlama\-7B and Qwen2\.5\-Coder\-7B, we apply LoRAHuet al\.\([2022](https://arxiv.org/html/2605.13229#bib.bib29)\)for all finetuning processes, including both supervised finetuning and preference optimization phases\.

All experiments are conducted on one NVIDIA RTX A800 GPU and Ubuntu 20\.04 LTS\. See our source code222[https://github\.com/nju\-websoft/CTO](https://github.com/nju-websoft/CTO)for other implementation details\.

### 4\.2Code Translation Results

Table[1](https://arxiv.org/html/2605.13229#S3.T1)shows the experimental results on the TransCoder\-Test and HumanEval\-X datasets\. Our CTO consistently outperforms existing methods across six translation tasks\.

Compared with the unsupervised code translation methods, CTO \(CodeT5\) surpasses both TransCoder and TransCoder\-ST across all evaluated scenarios\. This performance gap underscores the efficacy of supervised finetuning on parallel corpora, which facilitates the direct acquisition of cross\-lingual mappings and ensures higher translation reliability\. In contrast, unsupervised methods rely on back translation, which can introduce noise and misalignment\. For PPOCoder, we follow its reward function and parameter settings inShojaeeet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib13)\)and apply PPO on our finetuned CodeT5 model\. Contrary to our expectations, PPOCoder exhibits performance degradation in most cases\. This implies that PPOCoder does not always benefit from its reward function and may even introduce instability to the finetuned model\.

Since the preference dataset is derived from supervised finetuned models, its quality depends on the model’s ability to generate valid compilable samples\. However, CodeT5\-SFT generates fewer valid samples in translation tasks from Python, making preference data collection more difficult\. This observation is consistent with previous studiesFenget al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib26)\), which have noted a strong correlation between the performance of supervised finetuned models and the effectiveness of preference optimization\.

To assess the scalability of CTO on larger code models, we conduct experiments using decoder\-only LLMs, CodeLlama\-7B and Qwen2\.5\-Coder\-7B\. Across six language pairs and two benchmarks, CTO consistently outperforms their supervised finetuned counterparts \(CodeLlama\-7B\-SFT and Qwen2\.5\-Coder\-7B\-SFT\)\. These improvements validate the generalization of CTO across different model architectures and parameter sizes\.

### 4\.3Ablation Study and Alternative Methods

Table 3:Comparison of CA@1 scores between reference\-based metrics and the semantic reward model\.To investigate the effectiveness of multi\-objective preference optimization in our proposed method, we conduct an ablation study by systematically removing key components and analyzing their impact\. To measure the contribution of multi\-objective preference optimization, we design two variants:

- •CTO w/o semanticremoves the semantic preference from the preference optimization stage, meaning the model is optimized solely based on the syntax preference data\. Consequently, CTO degenerates into DPO, which helps isolate the impact of semantic reward\.
- •CTO w/o syntaxremoves the syntax preference from the preference optimization stage, i\.e\., the model is optimized solely based on the semantic preference data\. In this setting, CTO degenerates to vanilla DPO, where the reference translation is treated as the chosen sample and the rejected samples are standard outputs that do not match the reference\.

As shown in Table[2](https://arxiv.org/html/2605.13229#S4.T2), the results clearly indicate that each preference contributes significantly to the final performance\. Notably, in the w/o syntax variant, we rely on the example unit test shipped with the original dataset as auxiliary signals to guide optimization\. But the performance still falls short compared to CTO\. These results confirm that multi\-objective preference plays a crucial role in maximizing performance\.

Additionally, to further validate the effectiveness of our preference optimization strategy, we compare our method with two typical reward\-free preference optimization techniques: identity preference optimization \(IPO\)Azaret al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib21)\)and simple preference optimization \(SimPO\)Menget al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib22)\)\. These baselines perform preference optimization on supervised finetuned models, enabling us to assess whether CTO provides a tangible advantage over existing techniques\. As also presented in Table[2](https://arxiv.org/html/2605.13229#S4.T2), CTO achieves higher performance across most translation tasks on both TransCoder\-Test and HumanEval\-X benchmarks\. While IPO and SimPO fail to match the performance of our method in certain scenarios, they even cause performance degradation when applied to supervised finetuned models\.

Overall, these results demonstrate that our CTO effectively integrates syntactic and semantic modeling within a preference optimization framework, leading to superior performance in code translation tasks\. These findings reinforce the importance of multi\-objective preference optimization as a robust strategy for enhancing code translation models\.

### 4\.4Reward Model Evaluation

![Refer to caption](https://arxiv.org/html/2605.13229v1/x3.png)Figure 3:Distribution of negative sample types\.#### Negative Sample Types\.

To evaluate the quality and diversity of our semantic reward model’s training data, we analyze the distribution of various negative sample types, following the semantic error classification outlined inPanet al\.\([2024](https://arxiv.org/html/2605.13229#bib.bib2)\)\. As shown in Figure[3](https://arxiv.org/html/2605.13229#S4.F3), the categorization reveals that logical errors constitute the largest portion of the dataset\. The extensive coverage of error categories, coupled with their relatively balanced representation, prevents the model from being biased toward specific patterns and ensures robust detection of various semantic discrepancies\.

#### Latent Space Visualization\.

To assess the effectiveness of our trained reward model, we conduct an evaluation of our semantic reward model and visualize the source and target code embeddings derived from the reward model in a shared vector space\. Figure[4](https://arxiv.org/html/2605.13229#S4.F4)shows that embeddings from different programming languages form coherent clusters instead of being strictly segregated by language\. This verifies the model’s ability to fuse cross\-lingual code semantics into a unified embedding space, suggesting that it captures meaningful alignment between source and target language structures\.

#### Effectiveness of the Semantic Reward Model\.

Lastly, we compare the performance of our semantic reward model and previous reference\-based metrics such as CodeBLEURenet al\.\([2020](https://arxiv.org/html/2605.13229#bib.bib45)\)and CodeBERTScoreZhouet al\.\([2023](https://arxiv.org/html/2605.13229#bib.bib49)\)when applied to CTO\. As presented in Table[3](https://arxiv.org/html/2605.13229#S4.T3), our semantic reward model consistently outperforms both CodeBLEU and CodeBERTScore across most scenarios\. This performance gap highlights the advantage of leveraging our learned semantic model over relying solely on reference\-based metrics as the semantic reward\.

Overall, these results confirm the cross\-lingual semantic coherence captured by our semantic reward model and its superior effectiveness over prior reference\-based metrics\.

![Refer to caption](https://arxiv.org/html/2605.13229v1/x4.png)Figure 4:t\-SNE representation of source language code \(yellow\) and target language code \(green\)\.

## 5Conclusion

In this paper, we present CTO to improve code translation by syntax\-guided and semantic\-aware preference optimization\. We formulate code translation as a multi\-objective preference optimization problem\. By leveraging compiler feedback and semantic reward modeling, CTO effectively refines translation quality, ensuring both syntactic correctness and semantic alignment\. Experimental results across multiple benchmarks demonstrate that CTO significantly outperforms state\-of\-the\-art methods in code translation accuracy and robustness\. The proposed preference optimization approach is also generalizable, making it adaptable to various programming languages and model architectures\. In future work, we will extend CTO to support a broader range of programming languages and explore its effectiveness in other software migration scenarios\.

## Acknowledgments

This work was supported by the “111 Center” \(No\. B26023\), the Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China \(No\. JYB2025XDXM118\), and the Cooperation Fund of Huawei Cooperation Project \(No\. TC20230202021\-2024\-12\)\.

## References

- M\. G\. Azar, Z\. D\. Guo, B\. Piot, R\. Munos, M\. Rowland, M\. Valko, and D\. Calandriello \(2024\)A general theoretical paradigm to understand learning from human preferences\.InPMLR,Vol\.238,Palau de Congresos, VLC, Spain,pp\. 4447–4455\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px2.p2.1),[§4\.3](https://arxiv.org/html/2605.13229#S4.SS3.p3.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.CoRR2107\.03374,pp\. 1–35\.Cited by:[2nd item](https://arxiv.org/html/2605.13229#S4.I1.i2.p1.1)\.
- Y\. Du, H\. Sun, and M\. Li \(2024\)A joint learning model with variational interaction for multilingual program translation\.InASE,Sacramento, CA, USA,pp\. 1907–1918\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1)\.
- D\. Feng, B\. Qin, C\. Huang, Z\. Zhang, and W\. Lei \(2024\)Towards analyzing and understanding the limitations of DPO: a theoretical perspective\.CoRR2404\.04626,pp\. 1–8\.Cited by:[§4\.2](https://arxiv.org/html/2605.13229#S4.SS2.p3.1)\.
- Z\. Feng, D\. Guo, D\. Tang, N\. Duan, X\. Feng, M\. Gong, L\. Shou, B\. Qin, T\. Liu, D\. Jiang,et al\.\(2020\)CodeBERT: a pre\-trained model for programming and natural languages\.InEMNLP,Virtual,pp\. 1536–1547\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1)\.
- L\. Gee, M\. Gritta, G\. Lampouras, and I\. Iacobacci \(2025\)Code\-Optimise: self\-generated preference data for correctness and efficiency\.InFindings of NAACL,Albuquerque, NM, USA,pp\. 79–94\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p3.1)\.
- D\. Guo, S\. Ren, S\. Lu, Z\. Feng, D\. Tang, S\. Liu, L\. Zhou, N\. Duan, A\. Svyatkovskiy, S\. Fu,et al\.\(2021\)GraphCodeBERT: pre\-training code representations with data flow\.InICLR,Virtual,pp\. 1–18\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InICLR,Virtual,pp\. 1–26\.Cited by:[§4\.1](https://arxiv.org/html/2605.13229#S4.SS1.SSS0.Px5.p1.1)\.
- Y\. Huang, M\. Qi, Y\. Yao, M\. Wang, B\. Gu, C\. Clement, and N\. Sundaresan \(2023\)Program translation via code distillation\.InEMNLP,Singapore,pp\. 10903–10914\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Lu,et al\.\(2024\)Qwen2\.5\-coder technical report\.CoRR2409\.1218,pp\. 1–32\.Cited by:[2nd item](https://arxiv.org/html/2605.13229#S4.I2.i2.p1.1)\.
- A\. R\. Ibrahimzada, K\. Ke, M\. Pawagi, M\. S\. Abid, R\. Pan, S\. Sinha, and R\. Jabbarvand \(2024\)Repository\-level compositional code translation and validation\.CoRR2410\.24117,pp\. 1–22\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1)\.
- M\. A\. M\. Khan, M\. S\. Bari, X\. D\. Long, W\. Wang, Md\. R\. Parvez, and S\. Joty \(2024\)XCodeEval: an execution\-based large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval\.InACL,Bangkok, Thailand,pp\. 6766–6805\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1)\.
- H\. Le, Y\. Wang, A\. D\. Gotmare, S\. Savarese, and S\. C\. H\. Hoi \(2022\)CodeRL: mastering code generation through pretrained models and deep reinforcement learning\.InNeurIPS,New Orleans, LA, USA,pp\. 21314–21328\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.13229#S3.SS1.p2.1)\.
- F\. Liu, J\. Li, and L\. Zhang \(2023\)Syntax and domain aware model for unsupervised program translation\.InICSE,Melbourne, VIC, Australia,pp\. 755–767\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Lu, D\. Guo, S\. Ren, J\. Huang, A\. Svyatkovskiy, A\. Blanco, C\. B\. Clement, D\. Drain, D\. Jiang, D\. Tang,et al\.\(2021\)CodeXGLUE: a machine learning benchmark dataset for code understanding and generation\.InNeurIPS,Virtual,pp\. 1–16\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p1.1)\.
- Z\. Ma, X\. Zhang, J\. Zhang, J\. Yu, S\. Luo, and J\. Tang \(2025\)Dynamic scaling of unit tests for code reward modeling\.InACL,Vienna, Austria\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p3.1)\.
- Y\. Meng, M\. Xia, and D\. Chen \(2024\)SimPO: simple preference optimization with a reference\-free reward\.InNeurIPS,Vancouver, BC, Canada,,pp\. 1–38\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px2.p2.1),[§4\.3](https://arxiv.org/html/2605.13229#S4.SS3.p3.1)\.
- A\. T\. Nguyen, T\. T\. Nguyen, and T\. N\. Nguyen \(2014\)Migrating code with statistical machine translation\.InICSE Companion,Hyderabad, India,pp\. 544–547\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.InNeurIPS,New Orleans, LA, USA,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p2.1)\.
- R\. Pan, A\. R\. Ibrahimzada, R\. Krishna, D\. Sankar, L\. P\. Wassi, M\. Merler, B\. Sobolev, R\. Pavuluri, S\. Sinha, and R\. Jabbarvand \(2024\)Lost in translation: a study of bugs introduced by large language models while translating code\.InICSE,Lisbon, Portugal,pp\. 1–13\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p1.1),[§4\.4](https://arxiv.org/html/2605.13229#S4.SS4.SSS0.Px1.p1.1)\.
- R\. Park, R\. Rafailov, S\. Ermon, and C\. Finn \(2024\)Disentangling length from quality in direct preference optimization\.InACL,Bangkok, Thailand,pp\. 4998–5017\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px2.p2.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InNeurIPS,Vol\.36,New Orleans, LA, USA,pp\. 53728–53741\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p4.1),[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px2.p2.1)\.
- S\. Ren, D\. Guo, S\. Lu, L\. Zhou, S\. Liu, D\. Tang, N\. Sundaresan, M\. Zhou, A\. Blanco, and S\. Ma \(2020\)CodeBLEU: a method for automatic evaluation of code synthesis\.CoRR2009\.10297,pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p3.1),[§4\.4](https://arxiv.org/html/2605.13229#S4.SS4.SSS0.Px3.p1.1)\.
- B\. Rozière, J\. Gehring, F\. Gloeckle, S\. Sootla, I\. Gat, X\. E\. Tan, Y\. Adi, J\. Liu, T\. Remez, J\. Rapin,et al\.\(2023\)Code Llama: open foundation models for code\.CoRR2308\.12950,pp\. 1–48\.Cited by:[2nd item](https://arxiv.org/html/2605.13229#S4.I2.i2.p1.1)\.
- B\. Roziere, M\. Lachaux, L\. Chanussot, and G\. Lample \(2020\)Unsupervised translation of programming languages\.InNeurIPS,Vancouver, BC, Canada,pp\. 20601–20611\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p1.1),[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p1.1),[1st item](https://arxiv.org/html/2605.13229#S4.I1.i1.p1.1),[1st item](https://arxiv.org/html/2605.13229#S4.I2.i1.p1.1)\.
- B\. Roziere, J\. Zhang, F\. Charton, M\. Harman, G\. Synnaeve, and G\. Lample \(2022\)Leveraging automated unit tests for unsupervised code translation\.InICLR,Virtual,pp\. 1–20\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p1.1),[1st item](https://arxiv.org/html/2605.13229#S4.I2.i1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.CoRR1707\.06347,pp\. 1–12\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.CoRR2402\.03300,pp\. 1–30\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Shojaee, A\. Jain, S\. Tipirneni, and C\. K\. Reddy \(2023\)Execution\-based code generation using deep reinforcement learning\.Trans\. Mach\. Learn\. Res\.2023,pp\. 1–26\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p2.1),[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1),[§3\.1](https://arxiv.org/html/2605.13229#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.13229#S3.SS1.p2.1),[§3\.2](https://arxiv.org/html/2605.13229#S3.SS2.SSS0.Px1.p3.1),[2nd item](https://arxiv.org/html/2605.13229#S4.I2.i2.p1.1),[§4\.2](https://arxiv.org/html/2605.13229#S4.SS2.p2.1)\.
- M\. Szafraniec, B\. Roziere, H\. L\. F\. Charton, P\. Labatut, and G\. Synnaeve \(2023\)Code translation with compiler representations\.InICLR,Kigali, Rwanda,pp\. 1–20\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p1.1)\.
- A\. van den Oord, Y\. Li, and O\. Vinyals \(2018\)Representation learning with contrastive predictive coding\.CoRR1807\.03748,pp\. 1–13\.Cited by:[§3\.1](https://arxiv.org/html/2605.13229#S3.SS1.SSS0.Px2.p1.7)\.
- Y\. Wang, Y\. Wang, S\. Wang, D\. Guo, J\. Chen, J\. Grundy, X\. Liu, Y\. Ma, M\. Mao, H\. Zhang,et al\.\(2024\)RepoTransBench: a real\-world benchmark for repository\-level code translation\.CoRR2412\.17744,pp\. 1–23\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1)\.
- Y\. Wang, W\. Wang, S\. Joty, and S\. C\.H\. Hoi \(2021\)CodeT5: identifier\-aware unified pre\-trained encoder\-decoder models for code understanding and generation\.InEMNLP,Punta Cana, Dominican Republic,pp\. 8696–8708\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1),[2nd item](https://arxiv.org/html/2605.13229#S4.I2.i2.p1.1)\.
- W\. Yan, H\. Liu, Y\. Wang, Y\. Li, Q\. Chen, W\. Wang, T\. Lin, W\. Zhao, L\. Zhu, H\. Sundaram,et al\.\(2024\)CodeScope: an execution\-based multilingual multitask multidimensional benchmark for evaluating LLMs on code understanding and generation\.InACL,Bangkok, Thailand,pp\. 5511–5558\.Cited by:[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1)\.
- W\. Yan, Y\. Tian, Y\. Li, Q\. Chen, and W\. Wang \(2023\)CodeTransOcean: a comprehensive multilingual benchmark for code translation\.InEMNLP,Singapore,pp\. 5067–5089\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p1.1),[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1)\.
- Z\. Yang, F\. Liu, Z\. Yu, J\. W\. Keung, J\. Li, S\. Liu, Y\. Hong, X\. Ma, Z\. Jin, and G\. Li \(2024\)Exploring and unleashing the power of large language models in automated code translation\.Proc\. ACM Softw\. Eng\.1,pp\. 1585–1608\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p1.1)\.
- K\. Zhang, G\. Li, Y\. Dong, J\. Xu, J\. Zhang, J\. Su, Y\. Liu, and Z\. Jin \(2025\)CodeDPO: aligning code models with self generated and verified source code\.InACL,Vienna, Austria\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p3.1)\.
- Q\. Zheng, X\. Xia, X\. Zou, Y\. Dong, S\. Wang, Y\. Xue, L\. Shen, Z\. Wang, A\. Wang, Y\. Li,et al\.\(2023\)CodeGeeX: a pre\-trained model for code generation with multilingual benchmarking on HumanEval\-X\.InKDD,Long Beach, CA, USA,pp\. 5673–5684\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p1.1),[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1),[2nd item](https://arxiv.org/html/2605.13229#S4.I1.i2.p1.1)\.
- S\. Zhou, U\. Alon, S\. Agarwal, and G\. Neubig \(2023\)CodeBERTScore: evaluating code generation with pretrained models of code\.InEMNLP,Singapore,pp\. 13921–13937\.Cited by:[§4\.4](https://arxiv.org/html/2605.13229#S4.SS4.SSS0.Px3.p1.1)\.
- M\. Zhu, A\. Jain, K\. Suresh, R\. Ravindran, S\. Tipirneni, and C\. K\. Reddy \(2022a\)XLCoST: a benchmark dataset for cross\-lingual code intelligence\.CoRR2206\.08474,pp\. 1–20\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p1.1),[§2](https://arxiv.org/html/2605.13229#S2.SS0.SSS0.Px1.p2.1),[§4\.1](https://arxiv.org/html/2605.13229#S4.SS1.SSS0.Px1.p1.3)\.
- M\. Zhu, K\. Suresh, and C\. K\. Reddy \(2022b\)Multilingual code snippets training for program translation\.InAAAI,Vol\.36,Virtual,pp\. 11783–11790\.Cited by:[§1](https://arxiv.org/html/2605.13229#S1.p1.1)\.

Similar Articles

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

arXiv cs.CL

This paper introduces CroCo, a method for cross-lingual contrastive preference tuning on self-generated responses, showing that a reward model trained on English preferences can effectively rank responses in other languages, improving model performance across 14 languages without language-specific annotations.

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

arXiv cs.CL

This paper applies Direct Preference Optimization (DPO) to align Audio LLMs for transcribing English-Mandarin code-switching speech, achieving up to 89.6% MER reduction in-distribution and 20% out-of-distribution. It identifies three failure modes—language omission, translation instead of transcription, and hallucination—and shows that preference-based alignment effectively elicits correct code-switching behavior from multilingual Audio LLMs.