rubric-based-rewards

Tag

Cards List
#rubric-based-rewards

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

arXiv cs.LG · 5d ago Cached

RUBAS is a rubric-based reinforcement learning framework for agent safety that decomposes LLM agent behavior into four dimensions—tool-use safety, argument safety, response safety, and helpfulness—providing fine-grained rewards over complete trajectories. Experiments show RUBAS improves safety over standard alignment baselines while reducing tool-grounded hallucinations and maintaining competitive utility.

0 favorites 0 likes
#rubric-based-rewards

Reward Hacking in Rubric-Based Reinforcement Learning

Hugging Face Daily Papers · 2026-05-12 Cached

This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.

0 favorites 0 likes
← Back to home

Submit Feedback