step-level-feedback

#step-level-feedback

Process Rewards with Learned Reliability

arXiv cs.CL ↗ · 2026-05-18 Cached

BetaPRM is a process reward model that predicts both a step-level success probability and the reliability of that prediction using a Beta belief from Monte Carlo continuations, enabling adaptive computation allocation that reduces token usage by up to 33.57% while improving accuracy.

0 favorites 0 likes

step-level-feedback

Process Rewards with Learned Reliability

Submit Feedback