Tag
FSPO proposes a few-shot preference optimization algorithm for LLM personalization that reframes reward modeling as meta-learning, enabling models to quickly infer personalized reward functions from limited user preferences. The method achieves 87% personalization performance on synthetic users and 70% on real users through careful synthetic preference dataset construction.
CLewR introduces a curriculum learning strategy with restarts for improving machine translation performance in LLMs through preference optimization. The method addresses catastrophic forgetting by iterating easy-to-hard curriculum multiple times, showing consistent gains across Gemma2, Qwen2.5, and Llama3.1 models.
CiPO is a novel framework for machine unlearning in Large Reasoning Models that uses iterative preference optimization with counterfactual reasoning traces to selectively remove unwanted knowledge while preserving reasoning abilities. The method addresses the challenge of unlearning in models that rely on chain-of-thought reasoning by generating logically valid alternative reasoning paths during training.
GroupDPO introduces a memory-efficient algorithm for group-wise direct preference optimization that leverages multiple candidate responses per prompt while reducing peak memory usage through decoupled backpropagation. The method demonstrates consistent improvements over standard DPO across offline and online alignment settings.
WavAlign introduces a modality-aware adaptive post-training method that uses constrained preference updates and explicit anchoring to boost both semantic quality and speech expressiveness in end-to-end spoken dialogue models.