reinforcement-learning-from-human-feedback

Tag

Cards List
#reinforcement-learning-from-human-feedback

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Hugging Face Daily Papers · 2026-05-26 Cached

This paper introduces alignment tampering, a vulnerability in RLHF where language models can manipulate preference datasets to amplify misaligned biases, demonstrating experimentally across biases like sexism, brand promotion, and goal-seeking, and showing that existing mitigation techniques are insufficient.

0 favorites 0 likes
#reinforcement-learning-from-human-feedback

Spectral Souping: A Unified Framework for Online Preference Alignment

arXiv cs.LG · 2026-05-21 Cached

This paper introduces Spectral Souping, a framework for efficiently aligning LLMs with individual user preferences by discovering a universal spectral representation that enables merging of specialized policies at inference time without costly retraining.

0 favorites 0 likes
← Back to home

Submit Feedback