Tag
This paper proposes offline preference-based trajectory evaluation for agentic systems, which compares trajectories via temporal preferences rather than binary success metrics. It shows that this approach reduces ties from roughly 75% to 35%, improving discriminative power and data efficiency across diverse benchmarks.
This paper demonstrates that pointwise metrics like RMSE and MAE structurally mislead for inverse problems with multimodal posteriors, because optimal point estimators collapse the posterior and distort spectral features. It proposes a three-part evaluation protocol using per-event distributional accuracy, spectrum-fidelity diagnostics, and coverage-based calibration to address these failures.
This paper proposes evaluating speech articulation synthesis using phoneme recognition with articulatory features, addressing limitations of traditional metrics like point-wise distance. Experiments on a single-speaker RT-MRI dataset show the approach captures phonetic nuances and improves assessment.
This paper challenges the assumption that current Vision-Language Models faithfully synthesize multimodal data, proposing an information-theoretic Modality Translation Protocol with new metrics (Toll, Curse, Fallacy of Seeing) to evaluate trustworthiness over traditional multimodal gain.
Discusses the inadequacy of traditional metrics like accuracy and click-through rates for evaluating AI agent recommendations, proposing a more holistic long-term evaluation that includes user understanding, trade-offs, and real-world problem-solving.
This paper proposes two new metrics—Knowledge Separability Score (KSS) and Knowledge Persistence Score (KPS)—to evaluate cross-linguistic information removal in multilingual machine unlearning for LLMs, addressing shortcomings of prior per-language evaluation protocols.
Artificial Analysis introduces the Coding Agent Index, a new benchmark suite combining SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA to evaluate the performance of AI coding agents across diverse tasks.
This paper introduces MIND (Monge Inception Distance), a new metric for evaluating generative models that is more sample-efficient, faster, and robust than the standard Fréchet Inception Distance (FID).
This paper critiques the use of single-reference ground truth in ASR evaluation, arguing it causes epistemic injustice for speakers with aphasia. It proposes a new metric, Epistemic Injustice Distance, and advocates for WER-Range to account for diverse transcription conventions.
DeltaRubric is a research paper introducing a two-step multimodal preference evaluation approach using a single MLLM to improve reward modeling reliability through joint planning and verification.
This paper introduces FD-loss, a method to optimize Fréchet Distance as a training objective for visual generation by decoupling population and batch sizes. It demonstrates that this approach improves generator quality and suggests FID may not always accurately reflect visual quality.
OpenAI shares lessons learned on language model safety and misuse, discussing challenges in measuring risks, the limitations of existing benchmarks, and their development of new evaluation metrics for toxicity and policy violations. The post also highlights concerns about labor market impacts and the need for continued research on measuring social effects of AI deployment at scale.
OpenAI researchers developed a method to evaluate neural network robustness against unforeseen adversarial attacks, introducing a new metric called UAR (Unforeseen Attack Robustness) that assesses model performance against unanticipated distortion types beyond the commonly studied Lp norms.
This paper proposes using Annealed Importance Sampling to evaluate log-likelihoods for decoder-based generative models (VAEs, GANs, etc.), addressing the challenge of intractable likelihood estimation. The authors validate their method and provide evaluation code to analyze model performance, overfitting, and mode coverage.