human-evaluation

#human-evaluation

Introducing Real World VoiceEQ: Measuring the human quality of voice AI

Hugging Face Blog ↗ · 5d ago Cached

Real World VoiceEQ is a new benchmark for evaluating the human quality of voice AI, based on over a million human ratings, assessing models across speech recognition, synthesis, and understanding in real-world conditions.

0 favorites 0 likes

#human-evaluation

AI translation of literary texts is "fine", but readers still prefer human translations

Hugging Face Daily Papers ↗ · 2026-06-24 Cached

A study comparing human and AI translations of literary works shows that while machine translations are deemed 'fine', readers still prefer human translations for their immersiveness and clarity. Automatic metrics fail to capture reader preferences.

0 favorites 0 likes

#human-evaluation

Human Evaluation of GLM-5.2

Reddit r/LocalLLaMA ↗ · 2026-06-23

The author praises GLM-5.2, an MIT open-weights model, for its exceptional real-world performance in human evaluation benchmarks, claiming it rivals the best closed-source models like those from Claude.

0 favorites 0 likes

#human-evaluation

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

arXiv cs.AI ↗ · 2026-06-11 Cached

This exploratory study evaluates whether augmenting AI agents with a medical research skill package improves the quality of transcriptomic research analysis outputs compared to native AI, using a multi-model human evaluation in an NSCLC biomarker task. Results show a directional but statistically non-significant improvement, highlighting the need for larger, more robust evaluations.

0 favorites 0 likes

#human-evaluation

On the Limits of LLM-as-Judge for Scientific Novelty Assessment

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

This paper introduces RQ-Bench, a benchmark to evaluate LLMs' ability to assess the novelty of scientific research questions. It finds that LLM judges consistently rate generated questions as more novel than human experts do, raising concerns about the reliability of using LLMs for scientific novelty evaluation.

0 favorites 0 likes

#human-evaluation

Re-Centering Humans in LLM Personalization

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

This paper investigates the effectiveness of LLM personalization by putting real humans back into the evaluation loop, revealing systematic gaps between human judgments and LLM outputs at every stage of the personalization pipeline, and highlighting the limitations of synthetic data and LLM judges.

0 favorites 0 likes

#human-evaluation

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv cs.LG ↗ · 2026-05-19 Cached

This paper proposes a two-stage sampling design where LLM evaluations are used to augment, rather than replace, human ratings, and provides guidance on determining sample sizes for human and LLM reviews using a doubly robust estimator from missing data literature.

0 favorites 0 likes

#human-evaluation

Follow-up to my TranslateGemma-12b benchmark post: human reviewers flagged 71% of the segments automated metrics rated clean

Reddit r/LocalLLaMA ↗ · 2026-05-12

A human review of TranslateGemma-12b's translations revealed that 71% of segments rated clean by automated metrics actually contained errors, highlighting significant gaps in metric-only evaluation for multilingual translation quality.

0 favorites 0 likes

#human-evaluation

Prover-Verifier Games improve legibility of language model outputs

OpenAI Blog ↗ · 2024-07-17 Cached

OpenAI researchers found that optimizing language models purely for correct answers reduces human interpretability, and propose 'prover-verifier games' where a prover generates solutions and a verifier checks them, improving legibility for both humans and AI systems.

0 favorites 0 likes

human-evaluation

Submit Feedback