Tag
Introduces ODRPO, a framework that decomposes discrete rewards into ordinal binary indicators to improve robustness of policy optimization in RLAIF for LLMs, achieving up to 14.8% relative improvement with minimal overhead.
This paper introduces RQIQN, a robust quantile-based method for distributional reinforcement learning that uses Wasserstein geometry regularization to prevent distribution degeneration and improve performance in risk-sensitive tasks.