Tag
RUBAS is a rubric-based reinforcement learning framework for agent safety that decomposes LLM agent behavior into four dimensions—tool-use safety, argument safety, response safety, and helpfulness—providing fine-grained rewards over complete trajectories. Experiments show RUBAS improves safety over standard alignment baselines while reducing tool-grounded hallucinations and maintaining competitive utility.
This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.