Tag
This paper proposes a multi-method audit pipeline to evaluate how well frontier AI models follow their written behavioral specifications (Anthropic's constitution and OpenAI's Model Spec) under adversarial multi-turn pressure, finding that newer models show significantly lower violation rates (e.g., Claude Sonnet 4.6 at 2.0% vs. Sonnet 4 at 15.0%).