Tag
This survey reframes the alignment tuning of large language models as a data pipeline design problem, decomposing it into three stages: response synthesis, preference evaluation, and preference instantiation. It identifies design trade-offs and failure modes, and outlines open challenges such as prompt-level alignment and agentic settings.
This paper introduces CroCo, a method for cross-lingual contrastive preference tuning on self-generated responses, showing that a reward model trained on English preferences can effectively rank responses in other languages, improving model performance across 14 languages without language-specific annotations.
Announces an upcoming video on training tiny models for preference tuning, covering reward models, RLHF, DPO, ORPO with Unsloth and TRL.