Tag
This paper investigates whether frontier language models can detect when their prior assistant messages have been inserted or edited (prefill awareness). The study finds that models like Claude Opus 4.5 exhibit substantial prefill awareness, detecting tampered prefills in up to 35% of cases without false positives, which could compromise the validity of prefill-based safety evaluations.