CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

arXiv cs.CL 04/20/26, 04:00 AM Papers

Summary

CLewR introduces a curriculum learning strategy with restarts for improving machine translation performance in LLMs through preference optimization. The method addresses catastrophic forgetting by iterating easy-to-hard curriculum multiple times, showing consistent gains across Gemma2, Qwen2.5, and Llama3.1 models.

arXiv:2601.05858v2 Announce Type: replace Abstract: Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra-dragomir/CLewR.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# Curriculum Learning with Restarts for Machine Translation Preference Learning
Source: https://arxiv.org/html/2601.05858
Alexandra Dragomir1, Florin Brad1, Radu Tudor Ionescu2,⋄ 1Bitdefender, Bucharest, Romania 2Department of Computer Science, University of Bucharest, Bucharest, Romania ⋄raducu\.ionescu@gmail\.com

###### Abstract

Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra-dragomir/CLewR.

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

## 1 Introduction

Large language models (LLMs) have enabled zero-shot approaches in multilingual machine translation (MT) Tourvron et al. (2023). Methods for improving the MT abilities of LLMs can be broadly categorized into pre-training and post-training approaches. The former typically employ continual pre-training over large-scale monolingual or high-quality parallel data (Alves et al., 2024; Cui et al., 2025; Xu et al., 2024a). In contrast, post-training approaches aim to improve translation quality by employing preference optimization techniques, such as Direct Preference Optimization (DPO) (Rafailov et al., 2023), to distinguish high-quality translations from low-quality ones. Building on this line of work, Xu et al. (2024b) proposed Contrastive Preference Optimization (CPO), a reference-free technique that assesses pair distances based on log-probability differences only. More recently, Xu et al. (2025) introduced Adaptive Rejection Preference Optimization (ARPO), which further improves CPO by incorporating an adaptive penalty for the unpreferred term.

Despite the significant advances in preference optimization (PO) techniques (Rafailov et al., 2023; Xu et al., 2025, 2024b), a key factor that can significantly influence performance remains underexplored: the order in which data samples are processed during training. This aspect is central to *curriculum learning* (Bengio et al., 2009), a paradigm that studies how models can learn from easy to hard. The survey of Soviany et al. (2022) explains that easy-to-hard learning can be created by manipulating distinct factors, namely the data (Chang et al., 2021; Jarca et al., 2024; Nagatsuka et al., 2023), the model (Croitoru et al., 2025b; Sinha et al., 2020) or the target task (Liu et al., 2020a; Narvekar et al., 2016). Organizing the samples in a certain order falls under the umbrella of *data-level curriculum*. In the realm of data-level curriculum, researchers have explored both easy-to-hard and hard-to-easy data organizations, the latter being known as anti-curriculum (Ankner et al., 2024; Florensa et al., 2017; Jarca et al., 2025). Regardless of the data organization, several recent studies showed that curriculum learning can play an important role in various tasks, e.g. natural language inference (Poesina et al., 2024), intent detection (Gong et al., 2021), question answering (Liu et al., 2018), image classification (Liu et al., 2022), model pre-training (Madan et al., 2024; Nagatsuka et al., 2023), etc. Curriculum learning has also been applied in neural machine translation (NMT) (Kocmi and Bojar, 2017; Liu et al., 2020b; Platanios et al., 2019; Zhan et al., 2021), but contributions in this area predate the era of LLMs, making them hard or impossible to adapt to the novel "pre-training then fine-tuning" paradigm. With the emergence of preference optimization techniques applied during the fine-tuning stage (Rafailov et al., 2023; Xu et al., 2025, 2024b), some recent works (Croitoru et al., 2025a; Pattnaik et al., 2024) have integrated curriculum learning into DPO. However, such techniques do not explicitly address catastrophic forgetting (Kirkpatrick et al., 2017), a problem that occurs when samples learned at the beginning are forgotten by the model by the end of the training process, eventually degrading performance.

To this end, we propose a novel data-level curriculum learning framework for MT, in which the easy-to-hard training is restarted at every epoch. Our curriculum learning strategy with restarts (CLewR) is natively designed to mitigate catastrophic forgetting by iterating through all samples in every training epoch. We empirically demonstrate that CLewR leads to consistent performance gains in MT across several state-of-the-art preference optimization methods (DPO, CPO, ARPO) and LLM families (Gemma2, Qwen2.5, Llama3.1). Our results show that CLewR not only enhances highly competitive preference optimization methods, but also surpasses another competitor based on curriculum learning, namely CurriDPO (Pattnaik et al., 2024).

In summary, our contribution is threefold:

- We propose curriculum learning with restarts (CLewR), a novel method for preference optimization in MT, where the easy-to-hard curriculum is restarted at every epoch to avoid catastrophic forgetting.
- While previous work enhanced DPO with curriculum (Pattnaik et al., 2024), we introduce curriculum learning to newer preference optimization algorithms, namely CPO and ARPO.
- We show that our method outperforms competing curriculum approaches and consistently improves performance across multiple model families (Gemma2, Llama3.1, Qwen2.5) and preference optimization algorithms (DPO, CPO, ARPO).

## 2 Method

**CLewR.** We propose a data-level curriculum strategy, named CLewR, which is tailored to preference optimization in MT. We formally present our curriculum strategy in Algorithm 1. Training preference triplets of the form (x, y_w, y_l) are ordered (in step 9) based on a similarity score s(y_w, y_l) between the chosen (winning) y_w and rejected (losing) y_l translations. More precisely, the easiness of a pair of translations is defined as the similarity difference between the preferred and rejected translation, i.e. a high difference corresponds to an easy pair and a low difference to a hard pair. The similarity score is given by the average (step 7) of multiple MT metrics, thus making CLewR suitable for translation: BLEU (Papineni et al., 2002) (step 3), COMET-22 (Rei et al., 2022) (step 4), and METEOR (Banerjee and Lavie, 2005) (step 5) scores.

We emphasize that our method implicitly works with multiple correct reference translations for a given source sentence. By default, preference optimization techniques work with triplets of the form (x, y_w, y_l). If the dataset includes k preferred outputs for the same input, we can build k preference optimization triplets. Then, CLewR can simply apply PO starting from the easier tuples (references that are most dissimilar to the rejected sample) to the more difficult tuples.

**Algorithm 1 CLewR Preference Optimization**

1: Input: initial policy π_θ, training triplets {(x_i, y_i^w, y_i^l)}_i=1^N, learning rate μ.
2: for i = 1 to N do
3:     b ← BLEU(y_i^w, y_i^l)
4:     c ← COMET(y_i^w, y_i^l)
5:     m ← METEOR(y_i^w, y_i^l)
6:     b̂, ĉ, m̂ ← normalize_(0,1)(b, c, m)
7:     s_i ← 1/3(b̂ + ĉ + m̂)
8: end for
9: I ← argsort_↑({s_i}_i=1^N)
10: for epoch = 1 to E do
11:     for all batches B ⊂ I do
12:         L_PO ← loss(x_B, y_B^w, y_B^l, π_θ)
13:         θ ← optimize(θ, μ, ∇_θ L_PO)
14:     end for
15: end for
16: Output: optimized model π_θ

We provide examples of easy and hard preference samples in Appendix A.5. After sorting the triplets, training proceeds over a number of epochs (steps 10-15). At every epoch, the samples are divided into mini-batches (step 11) in the exact order established at step 9, i.e. there is no random shuffling involved. The easy-to-hard data permutation is reused at every epoch, which helps mitigate catastrophic forgetting. Learning is performed via a given PO method (steps 12-13). Note that fixing the order of samples in each epoch does not imply overfitting the order, i.e. the empirical risk does not depend on sample order. On the contrary, curriculum learning theory (Bengio et al., 2009) suggests that organizing the samples in a meaningful order can improve training dynamics, potentially leading to faster convergence and/or better optima.

**CLewR-z.** For ARPO (Xu et al., 2025), we develop an alternative CLewR implementation called CLewR-z, where the curriculum score s (used in step 7 of Algorithm 1) is derived from the ARPO distance z. The ARPO objective modifies the CPO objective to incorporate an adaptive penalty term τ_θ, which controls the importance of the rejected term y_l:

$$\mathcal{L}_{\text{ARPO}} = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}} \Big[\log\sigma\Big(\beta\log\pi_\theta(y_w|x) - \tau_\theta(y_w,y_l)\beta\log\pi_\theta(y_l|x)\Big) + \log\pi_\theta(y_w|x)\Big]$$

The value of τ_θ measures the similarity between y_w and y_l, ranging from 0 to 1:

$$\tau_\theta(y_w,y_l) = \min(e^{\eta \cdot z_\theta(y_w,y_l)} - 1, 1)$$

where η is a hyperparameter that controls the impact of z_θ, and z_θ(y_w, y_l) encodes the distance between the chosen and rejected responses by measuring the absolute difference in log-likelihoods:

$$z_\theta(y_w,y_l) = \left|\frac{\log(\pi_\theta(y_w|x))}{|y_w|} - \frac{\log(\pi_\theta(y_l|x))}{|y_l|}\right|$$

For curriculum learning, we employ s = -z_θ in step 7 of Algorithm 1. This version is called CLewR-z.

**Enhanced ARPO.** We further introduce an enhanced variant of ARPO by using a different distance function z'_θ(y_w, y_l) that also accounts for distances in the evaluation metric spaces. Specifically, we use:

$$z' = \eta_1 \cdot z_\theta + \eta_2 \cdot z_{\text{BLEU}} + \eta_3 \cdot z_{\text{COMET}}$$

where z_θ is the original distance. For both metrics, z_metric is given by 1 - metric/100 to normalize each metric to a (0,1) interval and have the same monotony as the original z_θ. Two dissimilar predictions result in low BLEU and COMET values, so z_BLEU and z_COMET will be high. Each z is multiplied by a corresponding scalar η, scaling them to similar intervals. We create multiple versions of enhanced ARPO by modifying the scalars η_1, η_2 and η_3. All ARPO versions are listed in Table 6.

## 3 Experimental Setup

**Dataset.** We test on the Flores-200 (Costa-Jussà et al., 2022) dataset. For generic LLMs, we use a group of six Romance languages. For MT-adapted models (GemmaX2), we use a group of three Romance languages, following Cui et al. (2025). We select Chinese to showcase generalization beyond Romance languages.

**LLM backbones.** We consider several candidate LLMs for preference tuning: Llama3.1-8B (Grattafiori et al., 2024), Qwen2.5-7B (Qwen et al., 2025), Gemma2-9B (Team et al., 2024), and GemmaX2-9B (Cui et al., 2025). We also consider X-ALMA (Xu et al., 2025) as a reference baseline, which is based on Llama2 (Touvron et al., 2023).

**Preference optimization**

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

Similar Articles

No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs

Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

Submit Feedback

Similar Articles

No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs

Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models