configurable-safety

Tag

Cards List
#configurable-safety

Configurable Reward Model for Balanced Safety Alignment

arXiv cs.CL · 2026-06-01 Cached

This paper introduces the Configurable Safety Reward Model (CSRM), a reward model that can be configured to accommodate heterogeneous and evolving safety requirements for LLM alignment. CSRM achieves state-of-the-art results on configurable safety benchmarks and improves the helpfulness-safety tradeoff.

0 favorites 0 likes
← Back to home

Submit Feedback