human-annotation

Tag

Cards List
#human-annotation

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

arXiv cs.LG · 3d ago Cached

This paper introduces PRECISE, an extension of Prediction-Powered Inference that combines a small set of human labels with a large set of LLM judgments to produce unbiased and variance-reduced estimates of ranking evaluation metrics like Precision@K. The method is validated on the ESCI benchmark and in a production A/B test, where it correctly identified the best system variant using only 100 human labels, confirmed by a +407 bps sales improvement.

0 favorites 0 likes
#human-annotation

The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

arXiv cs.AI · 4d ago Cached

This paper empirically examines when to interrupt autonomous AI agents during software execution, finding that affective-state thresholds saturate quickly, LLM judges achieve low F1 scores (0.17–0.40) at high cost, and human annotators themselves show near-chance agreement on intervention timing, making the construct unreliable as an optimization target.

0 favorites 0 likes
#human-annotation

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

Hugging Face Daily Papers · 2026-06-01 Cached

This paper presents a large-scale audit of human annotation reporting in NLP from 2018-2025, showing inconsistent documentation of critical details but improvements over time, and provides a framework and recommendations for better reporting.

0 favorites 0 likes
← Back to home

Submit Feedback