Tag
This paper analyzes spurious correlation learning in preference optimization methods like DPO, identifying mechanisms such as mean spurious bias and causal-spurious leakage. It proposes 'tie training' using equal-utility preference pairs as a mitigation strategy to reduce reliance on spurious features without degrading causal learning.
This paper introduces xi-DPO, a novel preference optimization method that reformulates the objective to minimize distance to optimal ratio reward margins, addressing hyperparameter tuning challenges in SimPO. Experimental results show that xi-DPO outperforms existing methods on open benchmarks.
This paper introduces CASPO, a framework for aligning token-level confidence with step-wise logical correctness in large reasoning models using iterative Direct Preference Optimization. It also proposes Confidence-aware Thought (CaT) for dynamically pruning uncertain reasoning branches during inference to improve reliability and efficiency.