Tag
This paper introduces PRECISE, an extension of Prediction-Powered Inference that combines a small set of human labels with a large set of LLM judgments to produce unbiased and variance-reduced estimates of ranking evaluation metrics like Precision@K. The method is validated on the ESCI benchmark and in a production A/B test, where it correctly identified the best system variant using only 100 human labels, confirmed by a +407 bps sales improvement.
Introduces TADDLE, a tool-augmented agent for detecting deficient LLM-generated peer reviews, along with an expert-annotated benchmark of 1,800 reviews on 50 ICLR 2025 papers. The system decomposes detection into four specialized analysis tools and uses two-stage semi-supervised learning for binary and multi-label classification.