EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Hugging Face Daily Papers Papers

Summary

EVA-Bench introduces a comprehensive end-to-end framework for evaluating voice agents, simulating realistic multi-turn conversations and measuring performance across voice-specific failure modes with novel accuracy (EVA-A) and experience (EVA-X) metrics. The benchmark includes 213 scenarios across enterprise domains and a perturbation suite for accent and noise robustness, revealing substantial gaps in current systems.

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.
Original Article
View Cached Full Text

Cached at: 05/14/26, 08:20 PM

Paper page - EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Source: https://huggingface.co/papers/2605.13841 Authors:

,

,

,

,

,

,

,

,

,

,

,

Abstract

EVA-Bench presents a comprehensive evaluation framework for voice agents that simulates realistic conversations and measures performance across multiple voice-specific failure modes using novel accuracy and experience metrics.

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestratesbot-to-bot audio conversationsover dynamicmulti-turn dialogues, withautomatic simulation validationthat detectsuser simulatorerror and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces twocomposite metrics:EVA-A(Accuracy), capturingtask completion, faithfulness, and audio-levelspeech fidelity; andEVA-X(Experience), capturingconversation progression, spoken conciseness, andturn-taking timing. Both metrics apply to differentagent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, acontrolled perturbation suitefor accent andnoise robustness, andpass@1,pass@k,pass^kmeasurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on bothEVA-Apass@1andEVA-Xpass@1; (2) peak and reliable performance diverge substantially (medianpass@k-pass^kgap of 0.44 onEVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

View arXiv pageView PDFProject pageGitHub114Add to collection

Get this paper in your agent:

hf papers read 2605\.13841

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.13841 in a model README.md to link it from this page.

Datasets citing this paper1

#### ServiceNow-AI/eva Viewer• Updatedabout 2 hours ago • 213 • 123 • 70

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.13841 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

Hugging Face Blog

ServiceNow AI releases EVA-Bench Data 2.0, an expanded open-source benchmark for evaluating voice agents across 3 enterprise domains (Airline CSM, IT Service Management, Healthcare HRSD) with 213 scenarios and 121 tools, validated against GPT-4.5, Gemini, and Claude.

An Empirical Study of Automating Agent Evaluation

arXiv cs.CL

This paper introduces EvalAgent, a system that automates the evaluation of AI agents by encoding domain-specific expertise, addressing the limitations of standard coding assistants in this task. It also presents AgentEvalBench, a benchmark for testing evaluation pipelines, and demonstrates significant improvements in evaluation reliability.