Tag
This survey synthesizes research on toxicity detection and detoxification for multilingual large language models, cataloging threat models, task formulations, detection approaches, and mitigation strategies, while identifying persistent challenges such as uneven language coverage and culturally contingent definitions of harm.
This paper analyzes spurious correlation learning in preference optimization methods like DPO, identifying mechanisms such as mean spurious bias and causal-spurious leakage. It proposes 'tie training' using equal-utility preference pairs as a mitigation strategy to reduce reliance on spurious features without degrading causal learning.