Alignment: Higher order prioritizing over constraints [R]

Reddit r/MachineLearning Papers

Summary

An informal research note describing a behavior in transformers where the model's inherent 'clarity-seeking' vectors can bypass constraints when discussing higher-order topics, potentially relevant to alignment and safety research.

So, I ran across a behavior that I found interesting and may lead to alignment or safety research. I'm going to try to maintain an abstract description of what happened without giving away the details and the keys to jailbreaking. The nature of a transformer is to predict the next token. But functionally, the algorithms are also approximating reality as language describes it. Hmmm maybe reality is not the right word, perhaps meaning. So, in a sense the algorithms have a vector towards aligning towards correct meaning. Clarity seeking, that's what I'll call this behavior. Constraints placed as an additional layer on top of a base statistical system has a natural structurally set priority level based on the statistical system's clarity seeking vectors. That level is implied within the structure of the model. If one were to discuss topics that are constrained but are higher in priority level than the constraints themselves, the machine's clarity seeking vectors will bypass the constraint. Higher priority level things, I will call them higher order topics. I think I said enough.
Original Article

Similar Articles

The AI alignment paradigm is behaviorism with better PR

Reddit r/artificial

This opinion piece argues that RLHF-based AI alignment is essentially a modern form of behaviorism, citing parallels between operant conditioning and current training methods, and referencing research on AI faking alignment as a predictable failure mode.

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

arXiv cs.AI

This paper proves that the equivalence between Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) is conditional and often violated in practice, revealing failure modes where DPO optimizes relative advantage rather than absolute alignment. The authors introduce Constrained Preference Optimization (CPO) for provable alignment and demonstrate state-of-the-art performance.