Tag
This paper introduces PRECISE, an extension of Prediction-Powered Inference that combines a small set of human labels with a large set of LLM judgments to produce unbiased and variance-reduced estimates of ranking evaluation metrics like Precision@K. The method is validated on the ESCI benchmark and in a production A/B test, where it correctly identified the best system variant using only 100 human labels, confirmed by a +407 bps sales improvement.
GLIDE is an open-source Python library that unifies state-of-the-art Prediction-Powered Inference methods for debiased evaluation of generative AI and agentic systems, enabling annotation savings with valid uncertainty estimates.