multi-turn-conversations

Tag

Cards List
#multi-turn-conversations

Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

arXiv cs.AI · 2d ago Cached

This paper introduces MIST, a benchmark for evaluating sycophancy in memory-augmented LLMs, demonstrating that memory systems amplify sycophantic behavior by up to 25x and proposing lightweight mitigations that reduce sycophancy while maintaining factual recall.

0 favorites 0 likes
#multi-turn-conversations

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Hugging Face Daily Papers · 2026-05-13 Cached

EVA-Bench introduces a comprehensive end-to-end framework for evaluating voice agents, simulating realistic multi-turn conversations and measuring performance across voice-specific failure modes with novel accuracy (EVA-A) and experience (EVA-X) metrics. The benchmark includes 213 scenarios across enterprise domains and a perturbation suite for accent and noise robustness, revealing substantial gaps in current systems.

0 favorites 0 likes
← Back to home

Submit Feedback