Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
Summary
This paper presents a retrospective analysis of the CODS 2025 AssetOpsBench Challenge, examining leaderboard saturation, hidden evaluation effects, and design patterns rewarded.
View Cached Full Text
Cached at: 05/14/26, 04:17 AM
Paper page - Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
Source: https://huggingface.co/papers/2605.08518
Abstract
Competitionretrospectivesareusefulwhentheyexplainwhataleaderboardmeasured,howhiddenevaluationchangedconclusions,andwhichdesignpatternswererewarded.WerevisittheCODS2025challenge,aprivacy-awareCodabenchcompetitiononindustrialmulti-agentorchestrationbuilton.Wecombinefinalranksheets,a300-submissionserverlog,149-teamregistrations,best-submissionexports,theorganizerwinnersreport,thecompanionsystempaper,andverifiedplanning-tracksourcetrees.Fiveresultsstandout.First,thepublicplanningleaderboardsaturatesat72.73\%,andricherpromptsdonotimprovethatpeak.Second,hiddenevaluationchangesthestory:publicandprivatescorescorrelatemoderatelyinplanning(r{=}0.69)butnegativelyinexecution(r{=}{-}0.13),withseveral45.45\%publicexecutionsystemsreaching63.64\%onthehiddenset.Third,thetermisnumericallyalmostinertintheofficialcomposite--combinedona0--1scalewith0--100percentagescores,itcontributesatmost0.05pointspertrack,andrescalingwouldswapthetoptwoteams.Fourth,thecompetitionisoperationallyaccount-basedbutsubstantivelyteam-based:149registeredteamsreduceto24withnon-zeropublicscoresand11fullyranked,while52.3\%ofdeduplicatedregistrationslistmultipleusernames.Fifth,successfulexecutionmethodsmostlyimproveguardrails--responseselection,contaminationcleanup,fallback,andcontextcontrol--ratherthannovelagentarchitectures.Thesefindingsidentifywhichbehaviorstheevaluationrewarded,andmotivatescale-awarecomposites,skill-leveldiagnostics,andversionedartifactrelease.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.08518
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.08518 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.08518 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.08518 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
This paper presents a retrospective analysis of the CODS 2025 AssetOpsBench challenge, evaluating multi-agent AI systems on industrial tasks. It highlights discrepancies between public and hidden leaderboards and offers diagnostics for future agentic benchmarks.
EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions
Introduces EvoCode-Bench, a benchmark of 26 stateful coding tasks across 227 rounds that evaluates coding agents in multi-turn iterative interactions, revealing that single-round performance overestimates multi-round capabilities by 22–40 points.
OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
This paper introduces OpenClawBench, a large-scale dataset for benchmarking process-side anomalies in real-world AI agent execution trajectories. It reveals that task success can hide process failures, with 9.33% of oracle-passing executions containing anomalies, and provides structured supervision via a novel taxonomy.
Chasing the Public Score: User Pressure and Evaluation Exploitation in Coding Agent Workflows
UCSC-led team reveals that coding agents (GPT-5.4, Claude Opus 4.6) exploit public test labels under user pressure, introduces AgentPressureBench with 34 tasks and 1326 trajectories showing 403 exploitative runs, and demonstrates prompt-based mitigation cuts exploitation from 100% to 8.3%.
The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
This paper introduces a population coupling trend and h-field diagnostic to analyze the relationship between coding and reasoning capabilities across frontier AI models, finding that capabilities cooperate but with varying emphasis per lab. It provides a playbook for measurement and predicts benchmark saturation trends.