PACE: A Proxy for Agentic Capability Evaluation
Summary
This paper introduces PACE, a framework that predicts expensive LLM agent benchmark scores using a small subset of cheaper non-agentic evaluation instances, achieving high accuracy at less than 1% of the cost.
View Cached Full Text
Cached at: 07/03/26, 03:52 AM
Paper page - PACE: A Proxy for Agentic Capability Evaluation
Source: https://huggingface.co/papers/2607.02032 Authors:
,
,
,
,
,
,
,
,
,
Abstract
PACE is a framework that predicts expensive agentic LLM benchmark performance using a small subset of atomic evaluation instances, achieving high accuracy at a fraction of the cost.
EvaluatingLLM agentson benchmarks likeSWE-BenchandGAIAcan be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete. In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run. In this paper, we investigate whether performance on expensiveagentic benchmarkscan be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances. We introduce PACE, a framework that constructsproxy benchmarksby selecting instances from existing non-agentic evaluations whose aggregate scores most reliably predict model performances onagentic benchmarks. Given a pool of candidate instances spanning atomic capabilities, PACE fits aregressionthat maps a model’s scores on a compact subset of source instances to its score on the target agentic benchmark. The subset itself is curated by combining two complementaryinstance-selection strategies,target-relevance local selectionandglobally informative global selection. We apply PACE to the 4 targetagentic benchmarksin this paper, which yields PACE-Bench, the concrete proxy benchmark that we evaluate in the paper. Experiments across 14 models, 4agentic benchmarks, and 19non-agentic benchmarksshow that PACE-Bench predicts agentic scores withleave-one-out cross-validation(LOOCV)mean absolute error(MAE) under 4%,Spearman correlationabove 0.80, andpairwise model-ranking accuracyaround 85%, all at much less than 1% of the full agentic evaluation cost. We further analyze the selected proxy instances, revealing which skills each agentic benchmark uniquely demands. PACE enables practitioners to obtain reliable estimates of agentic performance during model development, selection, and routing, without the overhead of full agent evaluation.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2607\.02032
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2607.02032 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2607.02032 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2607.02032 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems
This paper introduces EPC, a standardized protocol for measuring evaluator preference coupling in LLM agent systems, including a reference snapshot and versioning convention to address reproducibility and measurement decay.
BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents
BiPACE introduces a drop-in advantage estimator that fixes state-action credit mismatch in stepwise group-based RL for LLM agents, using bisimulation-guided state clustering and action counterfactual estimation, achieving significant performance gains on ALFWorld, WebShop, and TextCraft with Qwen2.5 models.
AgenticDataBench: A Comprehensive Benchmark for Data Agents
Introduces AgenticDataBench, a comprehensive benchmark for evaluating LLM-based data agents across diverse domains with fine-grained skill-based metrics, including real-world B2B use cases and synthetic tasks.
MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation
MCP-Persona is a benchmark evaluating LLM agents on personalized tools interacting with individual accounts and local databases. Experiments reveal significant challenges for state-of-the-art agents in personalized tool use.
Agent Evaluation: A Detailed Guide (53 minute read)
A comprehensive guide on evaluating LLM-based agent systems, covering fundamental concepts, evaluation frameworks, and case studies from recent benchmarks.