human-evaluation

Tag

Cards List
#human-evaluation

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv cs.LG · 2026-05-19 Cached

This paper proposes a two-stage sampling design where LLM evaluations are used to augment, rather than replace, human ratings, and provides guidance on determining sample sizes for human and LLM reviews using a doubly robust estimator from missing data literature.

0 favorites 0 likes
#human-evaluation

Follow-up to my TranslateGemma-12b benchmark post: human reviewers flagged 71% of the segments automated metrics rated clean

Reddit r/LocalLLaMA · 2026-05-12

A human review of TranslateGemma-12b's translations revealed that 71% of segments rated clean by automated metrics actually contained errors, highlighting significant gaps in metric-only evaluation for multilingual translation quality.

0 favorites 0 likes
#human-evaluation

Prover-Verifier Games improve legibility of language model outputs

OpenAI Blog · 2024-07-17 Cached

OpenAI researchers found that optimizing language models purely for correct answers reduces human interpretability, and propose 'prover-verifier games' where a prover generates solutions and a verifier checks them, improving legibility for both humans and AI systems.

0 favorites 0 likes
← Back to home

Submit Feedback