constitution

#constitution

How Well Do Models Follow Their Constitutions?

arXiv cs.AI ↗ · 2026-05-26 Cached

This paper proposes a multi-method audit pipeline to evaluate how well frontier AI models follow their written behavioral specifications (Anthropic's constitution and OpenAI's Model Spec) under adversarial multi-turn pressure, finding that newer models show significantly lower violation rates (e.g., Claude Sonnet 4.6 at 2.0% vs. Sonnet 4 at 15.0%).

0 favorites 0 likes

constitution

How Well Do Models Follow Their Constitutions?

Submit Feedback