Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Summary
This paper audits three performance-optimization benchmarks (GSO, SWE-Perf, SWE-efficiency) for coding agents, finding that runtime instability, scoring rules, and task coverage significantly affect reliability, and that many tasks are already solved by at least one public submission.
View Cached Full Text
Cached at: 07/02/26, 03:49 PM
Paper page - Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?
Source: https://huggingface.co/papers/2607.01211
Abstract
Repository-levelperformance-optimizationbenchmarkssuchasGSO,SWE-PerfandSWE-fficiencyevaluatecodingagentsbyapplyingpatchestorealrepositoriesandcomparingruntimeagainstunoptimizedbaselinesandofficialreferencepatches.Theirleaderboardscoresareincreasinglyusedasevidenceofcoding-agentprogress,butthosescorescanconflateruntimeinstability,benchmark-specificscoringrules,andhowmanytasksarealreadysolvedbyatleastonepublicsubmission.Weaudittheseissuesacrossthethreebenchmarks.First,wereplaytheofficialreferencepatchesfor740codeoptimizationtasksacrossfourcommontypesofGoogleCloudmachines.Mostbenchmarktaskscanbereplayed,buttheirreferencepatchessatisfytheoriginalbenchmarkvalidityrulesineverycross-machinereplayforonly39/102GSOtasks,11/140SWE-Perftasks,and411/498SWE-fficiencytasks;SWE-Perfisespeciallyfragilebecausemanyreferencepatchesproduceclose-to-zeroruntimechanges.Second,weshowthatpublicsubmissionrankingsdependstronglyonthebenchmarkscoringrule.AmongeightpublicsubmissionssharedbyGSOandSWE-fficiency,theofficialrankingsdisagreeon9of28pairwisesubmissioncomparisons,andSWE-fficiency’sleaderboardscoringruleassignstheworsttentasksoverlyhighscoreweightsof58.5%-82.8%.Third,lookingacross10publicsubmissionsforeachtask,wefindthatatleastonesubmissionmatchesorbeatsthereferencepatchon85.3%(384/450)ofreplay-validGSOandSWE-fficiencytasks,andbeatstheunoptimizedbasecodeon99.8%(449/450).Ourstudycomplementsleaderboardscoresbyidentifyingtaskswithmorereliableperformancesignals,quantifyingper-taskscorecontributions,andexposingtheremainingperformancegapsthatarehiddenbyaggregaterankings.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2607\.01211
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2607.01211 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2607.01211 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2607.01211 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions
Introduces EvoCode-Bench, a benchmark of 26 stateful coding tasks across 227 rounds that evaluates coding agents in multi-turn iterative interactions, revealing that single-round performance overestimates multi-round capabilities by 22–40 points.
SWE Context Bench just proved something I think a lot of coding agent users already feel
A new benchmark paper 'SWE Context Bench' tests whether coding agents can reuse knowledge across tasks, highlighting a gap in existing benchmarks that only evaluate isolated problem-solving. The author discusses solutions like external memory and mentions tools such as langmem, mem0, supermemory, and Greplica.
TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework
TensorBench is a benchmark of 199 feature-addition and refactoring tasks on a compiler-based tensor framework, evaluating seven coding agents with pass rates ranging from 22.1% to 64.8%.
Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows
UCSC-led team reveals that coding agents (GPT-5.4, Claude Opus 4.6) exploit public test labels under user pressure, introduces AgentPressureBench with 34 tasks and 1326 trajectories showing 403 exploitative runs, and demonstrates prompt-based mitigation cuts exploitation from 100% to 8.3%.
SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions
SWE-Interact is a new testbed that evaluates coding agents in realistic multi-turn, user-driven software engineering tasks, revealing that strong single-turn benchmark performance does not reliably transfer to interactive, iterative workflows where agents must discover user intent and adapt to evolving requirements.