ranking-metrics

#ranking-metrics

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

arXiv cs.LG ↗ · 3d ago Cached

This paper introduces PRECISE, an extension of Prediction-Powered Inference that combines a small set of human labels with a large set of LLM judgments to produce unbiased and variance-reduced estimates of ranking evaluation metrics like Precision@K. The method is validated on the ESCI benchmark and in a production A/B test, where it correctly identified the best system variant using only 100 human labels, confirmed by a +407 bps sales improvement.

0 favorites 0 likes

ranking-metrics

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

Submit Feedback