rlaif

Tag

Cards List
#rlaif

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

arXiv cs.LG · yesterday Cached

Introduces ODRPO, a framework that decomposes discrete rewards into ordinal binary indicators to improve robustness of policy optimization in RLAIF for LLMs, achieving up to 14.8% relative improvement with minimal overhead.

0 favorites 0 likes
← Back to home

Submit Feedback