PACE: A Proxy for Agentic Capability Evaluation

Hugging Face Daily Papers Papers

Summary

This paper introduces PACE, a framework that predicts expensive LLM agent benchmark scores using a small subset of cheaper non-agentic evaluation instances, achieving high accuracy at less than 1% of the cost.

Evaluating LLM agents on benchmarks like SWE-Bench and GAIA can be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete. In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run. In this paper, we investigate whether performance on expensive agentic benchmarks can be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances. We introduce PACE, a framework that constructs proxy benchmarks by selecting instances from existing non-agentic evaluations whose aggregate scores most reliably predict model performances on agentic benchmarks. Given a pool of candidate instances spanning atomic capabilities, PACE fits a regression that maps a model's scores on a compact subset of source instances to its score on the target agentic benchmark. The subset itself is curated by combining two complementary instance-selection strategies, target-relevance local selection and globally informative global selection. We apply PACE to the 4 target agentic benchmarks in this paper, which yields PACE-Bench, the concrete proxy benchmark that we evaluate in the paper. Experiments across 14 models, 4 agentic benchmarks, and 19 non-agentic benchmarks show that PACE-Bench predicts agentic scores with leave-one-out cross-validation (LOOCV) mean absolute error (MAE) under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%, all at much less than 1% of the full agentic evaluation cost. We further analyze the selected proxy instances, revealing which skills each agentic benchmark uniquely demands. PACE enables practitioners to obtain reliable estimates of agentic performance during model development, selection, and routing, without the overhead of full agent evaluation.
Original Article
View Cached Full Text

Cached at: 07/03/26, 03:52 AM

Paper page - PACE: A Proxy for Agentic Capability Evaluation

Source: https://huggingface.co/papers/2607.02032 Authors:

,

,

,

,

,

,

,

,

,

Abstract

PACE is a framework that predicts expensive agentic LLM benchmark performance using a small subset of atomic evaluation instances, achieving high accuracy at a fraction of the cost.

EvaluatingLLM agentson benchmarks likeSWE-BenchandGAIAcan be expensive, time-consuming, and requires complex infrastructure. A single evaluation can cost thousands of dollars and take days to complete. In contrast, non-agentic LLM benchmarks that test individual capabilities (e.g., reasoning, code generation) are fast and cheap to run. In this paper, we investigate whether performance on expensiveagentic benchmarkscan be accurately predicted by the performance on a small, carefully selected subset of atomic evaluation instances. We introduce PACE, a framework that constructsproxy benchmarksby selecting instances from existing non-agentic evaluations whose aggregate scores most reliably predict model performances onagentic benchmarks. Given a pool of candidate instances spanning atomic capabilities, PACE fits aregressionthat maps a model’s scores on a compact subset of source instances to its score on the target agentic benchmark. The subset itself is curated by combining two complementaryinstance-selection strategies,target-relevance local selectionandglobally informative global selection. We apply PACE to the 4 targetagentic benchmarksin this paper, which yields PACE-Bench, the concrete proxy benchmark that we evaluate in the paper. Experiments across 14 models, 4agentic benchmarks, and 19non-agentic benchmarksshow that PACE-Bench predicts agentic scores withleave-one-out cross-validation(LOOCV)mean absolute error(MAE) under 4%,Spearman correlationabove 0.80, andpairwise model-ranking accuracyaround 85%, all at much less than 1% of the full agentic evaluation cost. We further analyze the selected proxy instances, revealing which skills each agentic benchmark uniquely demands. PACE enables practitioners to obtain reliable estimates of agentic performance during model development, selection, and routing, without the overhead of full agent evaluation.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2607\.02032

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2607.02032 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2607.02032 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2607.02032 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

AgenticDataBench: A Comprehensive Benchmark for Data Agents

Hugging Face Daily Papers

Introduces AgenticDataBench, a comprehensive benchmark for evaluating LLM-based data agents across diverse domains with fine-grained skill-based metrics, including real-world B2B use cases and synthetic tasks.