agent-benchmarks

#agent-benchmarks

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

arXiv cs.AI ↗ · 2026-05-27 Cached

Anchor is a task-generation pipeline that addresses artifact drift in AI agent benchmarks by jointly producing instructions, environments, solutions, and verifiers from a single constraint optimization specification, yielding consistent and auditable evaluation tasks for enterprise workflows. The paper introduces ERP-Bench, a benchmark of 300 long-horizon tasks in a production ERP system, showing that frontier models satisfy explicit constraints in 26.1% of trials but reach optimal solutions in only 17.4%.

0 favorites 0 likes

#agent-benchmarks

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Hugging Face Daily Papers ↗ · 2026-05-27 Cached

TASTE is an automated method for generating challenging agent benchmarks with broader tool-use coverage by evolving tool sequences through adaptive contrastive n-gram modeling and iterative difficulty refinement. The resulting τ^c-Bench reveals that models nearly saturating existing benchmarks suffer severe performance drops, indicating saturation rather than robust skill.

0 favorites 0 likes

#agent-benchmarks

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

arXiv cs.CL ↗ · 2026-05-22 Cached

SynAE is a framework for evaluating the quality of synthetic data used in tool-calling agent evaluations, assessing validity, fidelity, and diversity across multiple axes. It addresses challenges of insufficient or sensitive real data by providing metrics to guide synthetic data generation.

0 favorites 0 likes

#agent-benchmarks

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines

Hugging Face Daily Papers ↗ · 2026-05-20 Cached

This paper introduces temporal semantic caching and MCP workflow optimizations for agentic plan-execute pipelines, achieving up to 30.6x speedup on cache hits and 1.67x overall speedup on the AssetOpsBench industrial benchmark.

0 favorites 0 likes

#agent-benchmarks

Interactive Evaluation Requires a Design Science

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

This position paper argues that interactive AI evaluation should be treated as a design science paradigm, proposing a two-axis taxonomy and reporting standards for assessing dynamic system behavior through trajectories.

0 favorites 0 likes

agent-benchmarks

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines

Interactive Evaluation Requires a Design Science

Submit Feedback