Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Hugging Face Daily Papers Papers

Summary

This paper audits three performance-optimization benchmarks (GSO, SWE-Perf, SWE-efficiency) for coding agents, finding that runtime instability, scoring rules, and task coverage significantly affect reliability, and that many tasks are already solved by at least one public submission.

Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and official reference patches. Their leaderboard scores are increasingly used as evidence of coding-agent progress, but those scores can conflate runtime instability, benchmark-specific scoring rules, and how many tasks are already solved by at least one public submission. We audit these issues across the three benchmarks. First, we replay the official reference patches for 740 code optimization tasks across four common types of Google Cloud machines. Most benchmark tasks can be replayed, but their reference patches satisfy the original benchmark validity rules in every cross-machine replay for only 39/102 GSO tasks, 11/140 SWE-Perf tasks, and 411/498 SWE-fficiency tasks; SWE-Perf is especially fragile because many reference patches produce close-to-zero runtime changes. Second, we show that public submission rankings depend strongly on the benchmark scoring rule. Among eight public submissions shared by GSO and SWE-fficiency, the official rankings disagree on 9 of 28 pairwise submission comparisons, and SWE-fficiency's leaderboard scoring rule assigns the worst ten tasks overly high score weights of 58.5%-82.8%. Third, looking across 10 public submissions for each task, we find that at least one submission matches or beats the reference patch on 85.3% (384/450) of replay-valid GSO and SWE-fficiency tasks, and beats the unoptimized base code on 99.8% (449/450). Our study complements leaderboard scores by identifying tasks with more reliable performance signals, quantifying per-task score contributions, and exposing the remaining performance gaps that are hidden by aggregate rankings.
Original Article
View Cached Full Text

Cached at: 07/02/26, 03:49 PM

Paper page - Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Source: https://huggingface.co/papers/2607.01211

Abstract

Repository-levelperformance-optimizationbenchmarkssuchasGSO,SWE-PerfandSWE-fficiencyevaluatecodingagentsbyapplyingpatchestorealrepositoriesandcomparingruntimeagainstunoptimizedbaselinesandofficialreferencepatches.Theirleaderboardscoresareincreasinglyusedasevidenceofcoding-agentprogress,butthosescorescanconflateruntimeinstability,benchmark-specificscoringrules,andhowmanytasksarealreadysolvedbyatleastonepublicsubmission.Weaudittheseissuesacrossthethreebenchmarks.First,wereplaytheofficialreferencepatchesfor740codeoptimizationtasksacrossfourcommontypesofGoogleCloudmachines.Mostbenchmarktaskscanbereplayed,buttheirreferencepatchessatisfytheoriginalbenchmarkvalidityrulesineverycross-machinereplayforonly39/102GSOtasks,11/140SWE-Perftasks,and411/498SWE-fficiencytasks;SWE-Perfisespeciallyfragilebecausemanyreferencepatchesproduceclose-to-zeroruntimechanges.Second,weshowthatpublicsubmissionrankingsdependstronglyonthebenchmarkscoringrule.AmongeightpublicsubmissionssharedbyGSOandSWE-fficiency,theofficialrankingsdisagreeon9of28pairwisesubmissioncomparisons,andSWE-fficiency’sleaderboardscoringruleassignstheworsttentasksoverlyhighscoreweightsof58.5%-82.8%.Third,lookingacross10publicsubmissionsforeachtask,wefindthatatleastonesubmissionmatchesorbeatsthereferencepatchon85.3%(384/450)ofreplay-validGSOandSWE-fficiencytasks,andbeatstheunoptimizedbasecodeon99.8%(449/450).Ourstudycomplementsleaderboardscoresbyidentifyingtaskswithmorereliableperformancesignals,quantifyingper-taskscorecontributions,andexposingtheremainingperformancegapsthatarehiddenbyaggregaterankings.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2607\.01211

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2607.01211 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2607.01211 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2607.01211 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

Hugging Face Daily Papers

SWE-Interact is a new testbed that evaluates coding agents in realistic multi-turn, user-driven software engineering tasks, revealing that strong single-turn benchmark performance does not reliably transfer to interactive, iterative workflows where agents must discover user intent and adapt to evolving requirements.