evaluation-metric

#evaluation-metric

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

arXiv cs.CL ↗ · yesterday Cached

A research paper proposing a new metric and stress-aware system for evaluating and preserving lexical stress in English-to-Chinese speech-to-speech translation, demonstrating significant improvements over existing approaches while maintaining translation quality.

0 favorites 0 likes

#evaluation-metric

GENIE: A Fine-Grained Measure for Novelty

arXiv cs.CL ↗ · 5d ago Cached

GENIE is a fine-grained evaluation metric that measures the novelty of LLM responses along task-specific features, providing more insight than holistic metrics.

0 favorites 0 likes

#evaluation-metric

A Novel Evaluation Metric for Unsupervised Learning in AIS-Based Maritime Anomaly Detection: MADQI

arXiv cs.LG ↗ · 2026-06-01 Cached

This paper introduces MADQI, a composite metric for evaluating unsupervised anomaly detection in maritime AIS data without requiring labeled data. The framework combines four metrics and achieves an 80.37% MADQI score on test datasets.

0 favorites 0 likes

#evaluation-metric

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

This paper identifies memory confabulation in Reflexion-style agents, where agents store incorrect task interpretations and persist in errors across environment resets. The authors introduce the Reflection Repetition Rate (RRR) metric to detect this and propose a mitigation that replaces open-ended self-diagnosis with programmatic failure signal extraction.

0 favorites 0 likes

#evaluation-metric

Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering

arXiv cs.CL ↗ · 2026-05-27 Cached

Granuscore is a reference-free measure of granularity for text analysis and question answering. It uses hierarchical embedding spaces to capture fine-grained vs. coarse language and demonstrates consistent differences in model behavior across QA benchmarks.

0 favorites 0 likes

#evaluation-metric

Measuring the Depth of LLM Unlearning via Activation Patching

arXiv cs.CL ↗ · 2026-05-26 Cached

The paper proposes the Unlearning Depth Score (UDS), a metric that uses activation patching to quantify how thoroughly target knowledge is erased from LLMs, achieving state-of-the-art faithfulness and robustness across multiple unlearning methods.

0 favorites 0 likes

#evaluation-metric

QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper introduces QuIDE, a framework featuring an Intelligence Index to evaluate the trade-offs between compression, accuracy, and latency in quantized neural networks. It demonstrates that optimal bit-widths vary by task, with 4-bit being ideal for LLMs and simple tasks, while 8-bit is better for complex CNNs.

0 favorites 0 likes

#evaluation-metric

ROSE: An Intent-Centered Evaluation Metric for NL2SQL

Hugging Face Daily Papers ↗ · 2026-04-14 Cached

ROSE is a novel intent-centered evaluation metric for NL2SQL that uses a Prover-Refuter cascade to assess semantic correctness independently of ground-truth SQL, achieving 24% better agreement with human experts than existing metrics. The paper addresses limitations of Execution Accuracy and provides a re-evaluation of 19 NL2SQL methods with publicly released resources.

0 favorites 0 likes

evaluation-metric

Submit Feedback