Tag
This paper introduces CLIPR, a framework that learns transferable latent user preferences from minimal conversational input to improve human-aligned decision making in LLMs.
This paper introduces xi-DPO, a novel preference optimization method that reformulates the objective to minimize distance to optimal ratio reward margins, addressing hyperparameter tuning challenges in SimPO. Experimental results show that xi-DPO outperforms existing methods on open benchmarks.
WildFeedback is a novel framework that leverages in-situ user feedback from actual LLM conversations to automatically create preference datasets for aligning language models with human preferences, addressing scalability and bias issues in traditional annotation-based alignment methods.