Tag
This paper presents an overview of the second edition of the TalentCLEF challenge at CLEF 2026, which includes tasks on job-person matching and job-skill matching in English and Spanish, attracting over 400 submissions.
This paper introduces Engram, an open-source bi-temporal memory engine for LLM agents that retrieves a compact context slice (∼9.6k tokens) to outperform the full-history baseline (79k tokens) by 10.4 accuracy points on LongMemEval, using a hybrid read path fusing dense, lexical, graph, and temporal signals.
This paper presents improvements to Activation Oracles (AOs) for interpreting residual stream activations, including a new conversational dataset, multi-layer injections, and on-policy training. The authors also release AObench, the first comprehensive evaluation suite for AO quality.
CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.
LongMINT is a benchmark for evaluating memory under multi-target interference in long-horizon agent systems.
EVA-Bench introduces a comprehensive end-to-end framework for evaluating voice agents, simulating realistic multi-turn conversations and measuring performance across voice-specific failure modes with novel accuracy (EVA-A) and experience (EVA-X) metrics. The benchmark includes 213 scenarios across enterprise domains and a perturbation suite for accent and noise robustness, revealing substantial gaps in current systems.
This article introduces TeamBench, a benchmark for evaluating agent coordination under enforced role separation, addressing issues where prompt-only roles may bypass intended constraints.
This paper introduces the DecodingTrust-Agent Platform (DTap), a controllable and interactive red-teaming platform for evaluating AI agent security across multiple domains. It also presents DTap-Red, an autonomous agent for discovering attack strategies, and DTap-Bench, a large-scale dataset for risk assessment.
This paper introduces SWE-WebDevBench, a comprehensive 68-metric framework for evaluating AI-powered application development platforms as virtual software agencies. The study highlights critical gaps in current platforms regarding specification understanding, backend reliability, production readiness, and security.
OpenGame is an open-source agentic framework for end-to-end web game creation, powered by the specialized GameCoder-27B model and evaluated via the new OpenGame-Bench benchmark.
OpenAI introduces GDPval, a new evaluation framework measuring AI model performance on economically valuable, real-world tasks across 44 occupations in the top 9 US GDP-contributing industries. The benchmark includes 1,320 specialized tasks based on actual professional work products, representing a progression from academic benchmarks to more realistic occupational assessments.