Tag
This paper proposes Safety Reflection Pretraining, a method that integrates regular safety reflections into pretraining corpora to embed self-monitoring directly into language modeling, showing improved safety alignment and reduced attack success rates in 1.7B models.