evaluation-metrics

#evaluation-metrics

Offline Preference-Based Trajectory Evaluation

arXiv cs.LG ↗ · 2026-06-17 Cached

This paper proposes offline preference-based trajectory evaluation for agentic systems, which compares trajectories via temporal preferences rather than binary success metrics. It shows that this approach reduces ties from roughly 75% to 35%, improving discriminative power and data efficiency across diverse benchmarks.

0 favorites 0 likes

#evaluation-metrics

Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

arXiv cs.LG ↗ · 2026-05-25 Cached

This paper demonstrates that pointwise metrics like RMSE and MAE structurally mislead for inverse problems with multimodal posteriors, because optimal point estimators collapse the posterior and distort spectral features. It proposes a three-part evaluation protocol using per-event distributional accuracy, spectrum-fidelity diagnostics, and coverage-based calibration to address these failures.

0 favorites 0 likes

#evaluation-metrics

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

arXiv cs.CL ↗ · 2026-05-21 Cached

This paper proposes evaluating speech articulation synthesis using phoneme recognition with articulatory features, addressing limitations of traditional metrics like point-wise distance. Experiments on a single-speaker RT-MRI dataset show the approach captures phonetic nuances and improves assessment.

0 favorites 0 likes

#evaluation-metrics

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Hugging Face Daily Papers ↗ · 2026-05-21 Cached

This paper challenges the assumption that current Vision-Language Models faithfully synthesize multimodal data, proposing an information-theoretic Modality Translation Protocol with new metrics (Toll, Curse, Fallacy of Seeing) to evaluate trustworthiness over traditional multimodal gain.

0 favorites 0 likes

#evaluation-metrics

How Should We Determine Whether an AI Agent's Recommendation Is Truly Quality-Driven?

Reddit r/AI_Agents ↗ · 2026-05-15

Discusses the inadequacy of traditional metrics like accuracy and click-through rates for evaluating AI agent recommendations, proposing a more holistic long-term evaluation that includes user understanding, trade-offs, and real-world problem-solving.

0 favorites 0 likes

#evaluation-metrics

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

arXiv cs.CL ↗ · 2026-05-15 Cached

This paper proposes two new metrics—Knowledge Separability Score (KSS) and Knowledge Persistence Score (KPS)—to evaluate cross-linguistic information removal in multilingual machine unlearning for LLMs, addressing shortcomings of prior per-language evaluation protocols.

0 favorites 0 likes

#evaluation-metrics

AA introduces Coding Agent Index - Performance Comparisons between Model & Harness Combinations

Reddit r/singularity ↗ · 2026-05-11

Artificial Analysis introduces the Coding Agent Index, a new benchmark suite combining SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA to evaluate the performance of AI coding agents across diverse tasks.

0 favorites 0 likes

#evaluation-metrics

MIND: Monge Inception Distance for Generative Models Evaluation

arXiv cs.LG ↗ · 2026-05-11 Cached

This paper introduces MIND (Monge Inception Distance), a new metric for evaluating generative models that is more sample-efficient, faster, and robust than the standard Fréchet Inception Distance (FID).

0 favorites 0 likes

#evaluation-metrics

Beyond Single Ground Truth: Reference Monism as Epistemic Injustice in ASR Evaluation

arXiv cs.CL ↗ · 2026-05-11 Cached

This paper critiques the use of single-reference ground truth in ASR evaluation, arguing it causes epistemic injustice for speakers with aphasia. It proposes a new metric, Epistemic Injustice Distance, and advocates for WER-Range to account for diverse transcription conventions.

0 favorites 0 likes

#evaluation-metrics

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Hugging Face Daily Papers ↗ · 2026-05-10 Cached

DeltaRubric is a research paper introducing a two-step multimodal preference evaluation approach using a single MLLM to improve reward modeling reliability through joint planning and verification.

0 favorites 0 likes

#evaluation-metrics

Representation Fréchet Loss for Visual Generation

Papers with Code Trending ↗ · 2026-04-30 Cached

This paper introduces FD-loss, a method to optimize Fréchet Distance as a training objective for visual generation by decoupling population and batch sizes. It demonstrates that this approach improves generator quality and suggests FID may not always accurately reflect visual quality.

0 favorites 0 likes

#evaluation-metrics

Lessons learned on language model safety and misuse

OpenAI Blog ↗ · 2022-03-03 Cached

OpenAI shares lessons learned on language model safety and misuse, discussing challenges in measuring risks, the limitations of existing benchmarks, and their development of new evaluation metrics for toxicity and policy violations. The post also highlights concerns about labor market impacts and the need for continued research on measuring social effects of AI deployment at scale.

0 favorites 0 likes

#evaluation-metrics

Testing robustness against unforeseen adversaries

OpenAI Blog ↗ · 2019-08-22 Cached

OpenAI researchers developed a method to evaluate neural network robustness against unforeseen adversarial attacks, introducing a new metric called UAR (Unforeseen Attack Robustness) that assesses model performance against unanticipated distortion types beyond the commonly studied Lp norms.

0 favorites 0 likes

#evaluation-metrics

On the quantitative analysis of decoder-based generative models

OpenAI Blog ↗ · 2016-11-14 Cached

This paper proposes using Annealed Importance Sampling to evaluate log-likelihoods for decoder-based generative models (VAEs, GANs, etc.), addressing the challenge of intractable likelihood estimation. The authors validate their method and provide evaluation code to analyze model performance, overfitting, and mode coverage.

0 favorites 0 likes

evaluation-metrics

Submit Feedback