evaluation-benchmark

#evaluation-benchmark

Overview of the TalentCLEF 2026: Skill and Job Title Intelligence for Human Capital Management

arXiv cs.CL ↗ · 10h ago Cached

This paper presents an overview of the second edition of the TalentCLEF challenge at CLEF 2026, which includes tasks on job-person matching and job-skill matching in English and Spanish, attracting over 400 submissions.

0 favorites 0 likes

#evaluation-benchmark

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

arXiv cs.CL ↗ · 2026-06-10 Cached

This paper introduces Engram, an open-source bi-temporal memory engine for LLM agents that retrieves a compact context slice (∼9.6k tokens) to outperform the full-history baseline (79k tokens) by 10.4 accuracy points on LongMemEval, using a hybrid read path fusing dense, lexical, graph, and temporal signals.

0 favorites 0 likes

#evaluation-benchmark

Building Better Activation Oracles

arXiv cs.LG ↗ · 2026-06-03 Cached

This paper presents improvements to Activation Oracles (AOs) for interpreting residual stream activations, including a new conversational dataset, multi-layer injections, and on-policy training. The authors also release AObench, the first comprehensive evaluation suite for AO quality.

0 favorites 0 likes

#evaluation-benchmark

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.

0 favorites 0 likes

#evaluation-benchmark

@_akhaliq: LongMINT Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

X AI KOLs Following ↗ · 2026-05-21 Cached

LongMINT is a benchmark for evaluating memory under multi-target interference in long-horizon agent systems.

0 favorites 0 likes

#evaluation-benchmark

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Hugging Face Daily Papers ↗ · 2026-05-13 Cached

EVA-Bench introduces a comprehensive end-to-end framework for evaluating voice agents, simulating realistic multi-turn conversations and measuring performance across voice-specific failure modes with novel accuracy (EVA-A) and experience (EVA-X) metrics. The benchmark includes 213 scenarios across enterprise domains and a perturbation suite for accent and noise robustness, revealing substantial gaps in current systems.

0 favorites 0 likes

#evaluation-benchmark

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

arXiv cs.AI ↗ · 2026-05-11 Cached

This article introduces TeamBench, a benchmark for evaluating agent coordination under enforced role separation, addressing issues where prompt-only roles may bypass intended constraints.

0 favorites 0 likes

#evaluation-benchmark

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Hugging Face Daily Papers ↗ · 2026-05-06 Cached

This paper introduces the DecodingTrust-Agent Platform (DTap), a controllable and interactive red-teaming platform for evaluating AI agent security across multiple domains. It also presents DTap-Red, an autonomous agent for discovering attack strategies, and DTap-Bench, a large-scale dataset for risk assessment.

0 favorites 0 likes

#evaluation-benchmark

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

Hugging Face Daily Papers ↗ · 2026-05-06 Cached

This paper introduces SWE-WebDevBench, a comprehensive 68-metric framework for evaluating AI-powered application development platforms as virtual software agencies. The study highlights critical gaps in current platforms regarding specification understanding, backend reliability, production readiness, and security.

0 favorites 0 likes

#evaluation-benchmark

OpenGame: Open Agentic Coding for Games

Papers with Code Trending ↗ · 2026-04-20 Cached

OpenGame is an open-source agentic framework for end-to-end web game creation, powered by the specialized GameCoder-27B model and evaluated via the new OpenGame-Bench benchmark.

0 favorites 0 likes

#evaluation-benchmark

Measuring the performance of our models on real-world tasks

OpenAI Blog ↗ · 2025-09-25 Cached

OpenAI introduces GDPval, a new evaluation framework measuring AI model performance on economically valuable, real-world tasks across 44 occupations in the top 9 US GDP-contributing industries. The benchmark includes 1,320 specialized tasks based on actual professional work products, representing a progression from academic benchmarks to more realistic occupational assessments.

0 favorites 0 likes

evaluation-benchmark

Submit Feedback