configurable-safety

#configurable-safety

Configurable Reward Model for Balanced Safety Alignment

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper introduces the Configurable Safety Reward Model (CSRM), a reward model that can be configured to accommodate heterogeneous and evolving safety requirements for LLM alignment. CSRM achieves state-of-the-art results on configurable safety benchmarks and improves the helpfulness-safety tradeoff.

0 favorites 0 likes

configurable-safety

Configurable Reward Model for Balanced Safety Alignment

Submit Feedback