safety-reflection

Tag

Cards List
#safety-reflection

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

arXiv cs.AI · 6d ago Cached

This paper proposes Safety Reflection Pretraining, a method that integrates regular safety reflections into pretraining corpora to embed self-monitoring directly into language modeling, showing improved safety alignment and reduced attack success rates in 1.7B models.

0 favorites 0 likes
← Back to home

Submit Feedback