Tag
This paper uses mechanistic interpretability to audit ethical reasoning in LLaMA 3.1-8B-Instruct, finding a 'Situational Anchor Effect' where domain-specific representations dominate moral computation, and proposing 'Mechanistic Alignment' as a research program.