Tag
A research paper proposing a new metric and stress-aware system for evaluating and preserving lexical stress in English-to-Chinese speech-to-speech translation, demonstrating significant improvements over existing approaches while maintaining translation quality.
GENIE is a fine-grained evaluation metric that measures the novelty of LLM responses along task-specific features, providing more insight than holistic metrics.
This paper introduces MADQI, a composite metric for evaluating unsupervised anomaly detection in maritime AIS data without requiring labeled data. The framework combines four metrics and achieves an 80.37% MADQI score on test datasets.
This paper identifies memory confabulation in Reflexion-style agents, where agents store incorrect task interpretations and persist in errors across environment resets. The authors introduce the Reflection Repetition Rate (RRR) metric to detect this and propose a mitigation that replaces open-ended self-diagnosis with programmatic failure signal extraction.
Granuscore is a reference-free measure of granularity for text analysis and question answering. It uses hierarchical embedding spaces to capture fine-grained vs. coarse language and demonstrates consistent differences in model behavior across QA benchmarks.
The paper proposes the Unlearning Depth Score (UDS), a metric that uses activation patching to quantify how thoroughly target knowledge is erased from LLMs, achieving state-of-the-art faithfulness and robustness across multiple unlearning methods.
This paper introduces QuIDE, a framework featuring an Intelligence Index to evaluate the trade-offs between compression, accuracy, and latency in quantized neural networks. It demonstrates that optimal bit-widths vary by task, with 4-bit being ideal for LLMs and simple tasks, while 8-bit is better for complex CNNs.
ROSE is a novel intent-centered evaluation metric for NL2SQL that uses a Prover-Refuter cascade to assess semantic correctness independently of ground-truth SQL, achieving 24% better agreement with human experts than existing metrics. The paper addresses limitations of Execution Accuracy and provides a re-evaluation of 19 NL2SQL methods with publicly released resources.