Tag
This paper applies Group Relative Policy Optimization (GRPO) to encoder-decoder Seq2Seq models for machine translation fine-tuning, using reference-free rewards (LaBSE and COMET-Kiwi) that require no parallel data, and achieves consistent improvements across 13 languages.