preference-optimization

#preference-optimization

FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

arXiv cs.CL ↗ · 2026-04-20 Cached

FSPO proposes a few-shot preference optimization algorithm for LLM personalization that reframes reward modeling as meta-learning, enabling models to quickly infer personalized reward functions from limited user preferences. The method achieves 87% personalization performance on synthetic users and 70% on real users through careful synthetic preference dataset construction.

0 favorites 0 likes

#preference-optimization

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

arXiv cs.CL ↗ · 2026-04-20 Cached

CLewR introduces a curriculum learning strategy with restarts for improving machine translation performance in LLMs through preference optimization. The method addresses catastrophic forgetting by iterating easy-to-hard curriculum multiple times, showing consistent gains across Gemma2, Qwen2.5, and Llama3.1 models.

0 favorites 0 likes

#preference-optimization

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

arXiv cs.CL ↗ · 2026-04-20 Cached

CiPO is a novel framework for machine unlearning in Large Reasoning Models that uses iterative preference optimization with counterfactual reasoning traces to selectively remove unwanted knowledge while preserving reasoning abilities. The method addresses the challenge of unlearning in models that rely on chain-of-thought reasoning by generating logically valid alternative reasoning paths during training.

0 favorites 0 likes

#preference-optimization

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

arXiv cs.CL ↗ · 2026-04-20 Cached

GroupDPO introduces a memory-efficient algorithm for group-wise direct preference optimization that leverages multiple candidate responses per prompt while reducing peak memory usage through decoupled backpropagation. The method demonstrates consistent improvements over standard DPO across offline and online alignment settings.

0 favorites 0 likes

#preference-optimization

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

WavAlign introduces a modality-aware adaptive post-training method that uses constrained preference updates and explicit anchoring to boost both semantic quality and speech expressiveness in end-to-end spoken dialogue models.

0 favorites 0 likes

preference-optimization

FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Submit Feedback