Tag
This paper proposes a two-stage sampling design where LLM evaluations are used to augment, rather than replace, human ratings, and provides guidance on determining sample sizes for human and LLM reviews using a doubly robust estimator from missing data literature.
A human review of TranslateGemma-12b's translations revealed that 71% of segments rated clean by automated metrics actually contained errors, highlighting significant gaps in metric-only evaluation for multilingual translation quality.
OpenAI researchers found that optimizing language models purely for correct answers reduces human interpretability, and propose 'prover-verifier games' where a prover generates solutions and a verifier checks them, improving legibility for both humans and AI systems.