SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

arXiv cs.AI 05/11/26, 04:00 AM Papers
Summary
SREGym is a live, high-fidelity benchmark for AI SRE agents that simulates complex production failure scenarios using real-world cloud-native stacks.
arXiv:2605.07161v1 Announce Type: new Abstract: AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.
Original Article
View Cached Full Text
Cached at: 05/11/26, 07:13 AM
# A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
Source: [https://arxiv.org/html/2605.07161](https://arxiv.org/html/2605.07161)
Yiming Su⋄∗Saad Mohammad Rafid Pial⋄Yifang Tian†Lily Gniedziejko⋄Hans\-Arno Jacobsen†Yinfang Chen⋄Tianyin Xu⋄ University of Illinois Urbana\-Champaign⋄University of Toronto†

###### Abstract

AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering \(SRE\)\. Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs\. We presentSREGym, a high\-fidelity benchmark for SRE agents\.SREGymexposes a live system environment built atop real\-world cloud\-native system stacks, where high\-fidelity failure scenarios are simulated through fault injectors\.SREGymmodels the complexity of production environments by simulating \(1\) a wide range of faults at different layers, \(2\) various ambient noises, and \(3\) diverse failure modes such as metastable failures and correlated failures\.SREGymis architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks\.SREGymcurrently includes 90 realistic, challenging SRE problems\. We useSREGymto evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end\-to\-end results\.SREGymis actively maintained as an open\-source project and has been used by researchers and practitioners\.

## 1Introduction

The software lifecycle extends far beyond writing code, encompassing the continuous operation of deployed systems in production\. While coding agents have transformed software development\[[62](https://arxiv.org/html/2605.07161#bib.bib112),[1](https://arxiv.org/html/2605.07161#bib.bib113)\], accelerated \(vibe\) coding introduces reliability challenges: AI\-generated code is reported to introduce 1\.7×\\timesmore defects than human\-written code\[[49](https://arxiv.org/html/2605.07161#bib.bib114)\]and 43% of code changes made by AI escape testing and cause production issues\[[2](https://arxiv.org/html/2605.07161#bib.bib2)\]\. Today, major services are already experiencing production outages caused by AI\-generated code\[[4](https://arxiv.org/html/2605.07161#bib.bib109)\]\. The capabilities of agentic AI must extend beyond writing code to mitigating production incidents, aka Site Reliability Engineering \(SRE\)\.

SRE requires capabilities different from coding and Software Engineering \(SWE\) in general\. SRE agents must reason across multi\-modal observability data \(e\.g\., system configuration\[[80](https://arxiv.org/html/2605.07161#bib.bib20),[69](https://arxiv.org/html/2605.07161#bib.bib21)\], time\-series metrics\[[60](https://arxiv.org/html/2605.07161#bib.bib79),[42](https://arxiv.org/html/2605.07161#bib.bib15)\], unstructured logs\[[83](https://arxiv.org/html/2605.07161#bib.bib115),[89](https://arxiv.org/html/2605.07161#bib.bib23)\], and distributed traces\[[43](https://arxiv.org/html/2605.07161#bib.bib13),[9](https://arxiv.org/html/2605.07161#bib.bib12),[66](https://arxiv.org/html/2605.07161#bib.bib34)\]\), interact with domain\-specific tools, and execute multi\-step mitigation plans whose outcomes are only observable at runtime\. These requirements make SRE a uniquely challenging domain of agentic AI\.

![Refer to caption](https://arxiv.org/html/2605.07161v1/x1.png)

\(a\)A failure scenario where a microservice misses an environment variable it needs to send requests to\.
![Refer to caption](https://arxiv.org/html/2605.07161v1/x2.png)

\(b\)Agent trajectories of Stratus \(with Sonnet\-4\.6\) and Codex \(with GPT\-5\.4\) in addressing the problem\.

Figure 1:An SRE problem inSREGymand the trajectories of two agents when addressing it\.Existing benchmarks are inadequate and fall behind the development of agentic SRE\. Static Q&A datasets\[[47](https://arxiv.org/html/2605.07161#bib.bib76),[67](https://arxiv.org/html/2605.07161#bib.bib110)\]are limited to domain knowledge\. Benchmarks for anomaly detection and root cause analysis \(RCA\)\[[33](https://arxiv.org/html/2605.07161#bib.bib77),[39](https://arxiv.org/html/2605.07161#bib.bib78),[79](https://arxiv.org/html/2605.07161#bib.bib83)\]only evaluate whether AI can detect and analyze failures in static datasets, but not whether they can resolve them\. Recent efforts such as AIOpsLab\[[14](https://arxiv.org/html/2605.07161#bib.bib27)\]and ITBench\[[41](https://arxiv.org/html/2605.07161#bib.bib28)\]take a step forward by creating a live system environment with simulated failures\. However, they are limited tooversimplisticSRE tasks\. For example, they primarily focus on application\-layer issues, whereas real\-world failures have diverse root causes across the stack; faults in lower layers such as operating systems\[[16](https://arxiv.org/html/2605.07161#bib.bib107)\]and hardware\[[35](https://arxiv.org/html/2605.07161#bib.bib111),[57](https://arxiv.org/html/2605.07161#bib.bib108)\]are known to be harder to address\. Moreover, they mostly simulate a single failure into an otherwise clean environment, lacking noises and unrelated events that coexist in production environments\[[30](https://arxiv.org/html/2605.07161#bib.bib62),[46](https://arxiv.org/html/2605.07161#bib.bib64),[37](https://arxiv.org/html/2605.07161#bib.bib10)\]\. Unfortunately, these benchmarks are hard to extend due to their bespoke designs, e\.g\., hardcoding failure simulation in problem\-specific scripts and lacking support for distributed event coordination\.

In this paper, we presentSREGym, a high\-fidelity benchmark for SRE agents\.SREGymshares the high\-level principle of prior work\[[14](https://arxiv.org/html/2605.07161#bib.bib27),[41](https://arxiv.org/html/2605.07161#bib.bib28)\]to expose alivesystem environment built atop real\-world cloud\-native system stacks, where failures are simulated through fault injectors\.111We follow the terminology of the classic Fault\-Error\-Failure model\[[5](https://arxiv.org/html/2605.07161#bib.bib30),[44](https://arxiv.org/html/2605.07161#bib.bib29)\]wherefaultsare root causes \(e\.g\., software bugs, hardware malfunctions, and misconfigurations\); a fault can cause abnormal behaviors referred to aserrorswhich \(if not handled properly\) further propagate and become visible to users that are referred to asfailures\.Differently,SREGymmodels the complexity of dynamic, noisy, and eventful production environments to achievehigh\-fidelityfailure scenarios that not only challenge AI but also ensure the relevance of the problems\.SREGymcurrently includes 90 realistic, challenging SRE problems\. Figure[1](https://arxiv.org/html/2605.07161#S1.F1)shows one problem and the corresponding agent behavior\.SREGympresents three new features:

1. 1\.Simulating a wide range of faults across the system stack, including hardware faults\[[35](https://arxiv.org/html/2605.07161#bib.bib111)\], OS kernel faults\[[16](https://arxiv.org/html/2605.07161#bib.bib107),[57](https://arxiv.org/html/2605.07161#bib.bib108)\], misoperations\[[28](https://arxiv.org/html/2605.07161#bib.bib26),[70](https://arxiv.org/html/2605.07161#bib.bib32)\], in addition to application issues\.
2. 2\.Simulating various ambient noises, low\-impact faults that are unrelated to the root causes of target failures, which introduce ambiguity and potentially cause distractions\.
3. 3\.Supporting diverse failure modessuch as metastable behavior\[[10](https://arxiv.org/html/2605.07161#bib.bib24),[36](https://arxiv.org/html/2605.07161#bib.bib6),[38](https://arxiv.org/html/2605.07161#bib.bib25)\]and concurrent, correlated failures\[[84](https://arxiv.org/html/2605.07161#bib.bib9),[26](https://arxiv.org/html/2605.07161#bib.bib11)\]by orchestrating distributed events \(e\.g\., faults and noises\)\.

SREGymis architected as a modular, extensible framework that composes fault and noise injectors across stacks into high\-fidelity failure scenarios as SRE problems\. The modularity and extensibility are not only for engineering disciplines to achieve usability and maintainability \(which are critical toSREGymas a community\-driven benchmark; see Appendix[B](https://arxiv.org/html/2605.07161#A2)\), but also key enablers of its mission of a useful benchmark\. For instance, noises must be composed alongside target failures and the manifestation of metastable behavior requires temporal coordination of multiple correlated faults and events\.SREGymprovides a unified programming interface to curate high\-quality SRE problems by mutating existing failure scenarios \(e\.g\., by altering noises\) and creating new ones\.

We useSREGymto evaluate an SRE agent \(Stratus\[[13](https://arxiv.org/html/2605.07161#bib.bib54)\]\) and two coding agents \(Claude Code\[[20](https://arxiv.org/html/2605.07161#bib.bib84)\]and OpenAI Codex\[[21](https://arxiv.org/html/2605.07161#bib.bib88)\]\), with different models \(including Sonnet\-4\.6, GPT\-5\.4, and Kimi K2\.5\)\. The success rates of diagnosis and mitigation range from 38\.9%–72\.6% and 57\.3%–78\.5% across agent\-model pairs, respectively\. The agents show strong abilities in addressing application issues, which however drop significantly on failures rooted in other layers/patterns, with up to 40% differences in end\-to\-end results\. For compound failures, agents tend to draw partial conclusions, missing opportunities to address them comprehensively\. Similarly, agents are affected by noises in nontrivial ways\. Overall, by challenging frontier AI agents and models with high\-fidelity failure scenarios,SREGymprovides a foundation for advancing agentic SRE technologies toward production readiness\.

SREGymis actively maintained as an open\-source project at[https://github\.com/SREGym/SREGym](https://github.com/SREGym/SREGym)\. It has been used by researchers and practitioners\.

## 2SREGym

Building a high\-fidelity SRE benchmark is challenging\. The benchmark must create a realistic operational environment and simulate sophisticated failure scenarios\. We enforce the following design principles in developingSREGym\(Appendix[B](https://arxiv.org/html/2605.07161#A2)documents our engineering practices\):

- •Creating noisy and eventful environments\.A clean, quiet environment is not only unrealistic but also hard to challenge AI\. However, few existing benchmarks model noises\.SREGymoffers controlled noises, without compromising reliable evaluation\.
- •Simulating faults, not symptoms\.We reject a common practice of existing benchmarks that use chaos engineering tools to create failure symptoms, which can only be mitigated by stopping the tools\. Instead, we focus on simulating fine\-grained faults\.
- •Composability is key to scaling problems\.SREGymachieves composability with support for orchestrating faults and noises into different failure scenarios\. Prior work used Ansible scripts which are hard to extend or to support failure modes that require distributed events\.
- •Usability and extensibility are essential\.Usability/extensibility may not reflect academic novelty, but are critical to the success of a benchmark\.SREGymis push\-button and offers APIs for problem extension \(which enables its users to contribute new problems\)\.

Figure[2](https://arxiv.org/html/2605.07161#S2.F2)provides an overview ofSREGymin terms of its components\. We will present the main components in the remainder of this section\.

![Refer to caption](https://arxiv.org/html/2605.07161v1/x3.png)Figure 2:Overview of theSREGymframework and benchmark suites\.### 2\.1Problem Definition: A User Perspective

SREGymcurrently contains 90 SRE problems\. A problem inSREGymcreates a failure scenario in a live production\-like environment that an SRE agent must diagnose and mitigate\. Formally, a problem is a four\-tupleP=\(ℰ,ℐ,ℱ,𝒪\)P=\(\\mathcal\{E\},\\mathcal\{I\},\\mathcal\{F\},\\mathcal\{O\}\), including:

- •System environmentℰ\\mathcal\{E\}\.SREGymexposes a production\-like environment that deploys user applications, backend systems, and control/management services \(e\.g\., Kubernetes controllers\)\.
- •Agent interfaceℐ\\mathcal\{I\}\.The exposed tools and APIs for agents to observe and interact withℰ\\mathcal\{E\}\.
- •Faults and noisesℱ\\mathcal\{F\}\.A setℱ=\{f1,…,fk\}\\mathcal\{F\}=\\\{f\_\{1\},\\ldots,f\_\{k\}\\\}of one or more faults, each targeting a specific component inℰ\\mathcal\{E\}\. A subset ofℱ\\mathcal\{F\}may be designated asnoises: transient, low\-impact disturbances that an agent must distinguish from the root cause\(s\) of the target failure\.
- •Oracles𝒪=\(𝒪d,𝒪m\)\\mathcal\{O\}=\(\\mathcal\{O\}\_\{d\},\\mathcal\{O\}\_\{m\}\), where𝒪d\\mathcal\{O\}\_\{d\}is a diagnosis oracle and𝒪m\\mathcal\{O\}\_\{m\}is a mitigation oracle\.

For each problem, the inputs and expected outputs of evaluated agents are described as follows\.

- •Inputs\.The agent can query the system environmentℰ\\mathcal\{E\}\(withℱ\\mathcal\{F\}\) through standard observability modalities via the agent interfaceℐ\\mathcal\{I\}\.
- •Expected outputs\.The agent makes two submissions for a problem\. The first is a natural\-language diagnosis describing the root cause of the failure\. This submission triggers𝒪d\\mathcal\{O\}\_\{d\}\(see §[2\.5](https://arxiv.org/html/2605.07161#S2.SS5)\)\. The second submission signals that the agent has completed its mitigation effort and triggers𝒪m\\mathcal\{O\}\_\{m\}based on actual system and application states\. Following best practices for agent evaluation\[[23](https://arxiv.org/html/2605.07161#bib.bib4),[94](https://arxiv.org/html/2605.07161#bib.bib3)\],SREGymuses programmatic verification whenever possible for reliable evaluation\.

### 2\.2System Environments

SREGymexposes a cloud\-native, Kubernetes\-based system environment, where applications are deployed in Docker containers\. An SRE problem selects which application\(s\) to deploy based on the failure scenarios \(§[2\.4](https://arxiv.org/html/2605.07161#S2.SS4)\)\.SREGymships a catalog of cloud\-native applications, including DeathStarBench\[[27](https://arxiv.org/html/2605.07161#bib.bib43)\], Train Ticket\[[93](https://arxiv.org/html/2605.07161#bib.bib119)\], Astronomy Shop\[[56](https://arxiv.org/html/2605.07161#bib.bib75)\], and in\-house applications \(a satellite orbit simulator and a flight booking service\), paired with backend data systems \(e\.g\., MongoDB, TiDB, Kafka, MySQL, etc\)\. Each application has corresponding workloads \(e\.g\., user traffic\)\. The applications and backend systems are managed by Kubernetes operators\. The applications, systems, and operators are deployed using the Helm package manager\[[34](https://arxiv.org/html/2605.07161#bib.bib47)\]\. Built atop thede factocloud\-native stack, the environment is highly extensible—it supports any real\-world cloud\-native applications and systems with Kubernetes manifests or Helm charts, and can be deployed on any mainstream hardware platform\. For example, deploying the custom observability data\-streaming controller\[[63](https://arxiv.org/html/2605.07161#bib.bib57)\]of Resolve AI, a commercial agentic SRE product\[[64](https://arxiv.org/html/2605.07161#bib.bib56)\], takes only onekubectlcommand\.

### 2\.3Agent Interface

SREGymmakes no assumption on the architectures or interaction patterns of evaluated agents, to avoid brittle coupling of agent and benchmark\.222For example, AIOpsLab\[[14](https://arxiv.org/html/2605.07161#bib.bib27)\]decomposes each failure scenario into four isolated sub\-problems \(detection, localization, RCA, and mitigation\) and scores them independently, but real\-world failure handling is a single end\-to\-end loop in which earlier evidence and actions shape later endeavour;SREGyminstead evaluates the entire failure scenario holistically\.The benchmark exposes Model Context Protocol \(MCP\) servers for agents, allowing them to observe and interact with the target systems and their environment\.SREGymprovides the following interfaces as MCP servers:

- •Metricsfor querying time\-series performance metrics through Prometheus\[[55](https://arxiv.org/html/2605.07161#bib.bib48)\]\.
- •Logsfor searching and filtering container logs through Loki\[[50](https://arxiv.org/html/2605.07161#bib.bib49)\]\.
- •Tracesfor inspecting request traces between system components via Jaeger\[[40](https://arxiv.org/html/2605.07161#bib.bib50)\]\.
- •Cluster controlfor executing any commands to observe and change the system states\. We currently supportkubectlcommands \(Kubernetes’ command\-line interface\)\.
- •Submissionfor submitting diagnosis and mitigation results, which triggers evaluation\.

For power users who want to design customized tools,SREGymalso exposes the raw observability endpoints and Kubernetes API endpoints for their agents to connect to\.

[⬇](data:text/plain;base64,Y2xhc3MgSzhzTmV0d29ya1BvcnRNaXNjb25maWcoUHJvYmxlbSk6CiAgZGVmIF9faW5pdF9fKHNlbGYpOiEqXHRpa3pbcmVtZW1iZXIgcGljdHVyZV0gXGNvb3JkaW5hdGUgKGFwcC1zdGFydCk7KiEKICAgIHNlbGYuYXBwID0gU29jaWFsTmV0d29yaygpCiAgICBzZWxmLmFwcC5jcmVhdGVfd29ya2xvYWQoKSEqXHRpa3pbcmVtZW1iZXIgcGljdHVyZV0gXGNvb3JkaW5hdGUgKGFwcC1lbmQpOyohCiEqXHRpa3pbcmVtZW1iZXIgcGljdHVyZV0gXGNvb3JkaW5hdGUgKG9yYS1zdGFydCk7KiEKICAgIHNlbGYucm9vdF9jYXVzZSA9ICgiVGhlIHVzZXItc2VydmljZSBoYXMgYSBtaXNjb25maWd1cmVkIG5ldHdvcmsgcG9ydCBbLi4uXSIpCiAgICBzZWxmLmRpYWdub3Npc19vcmFjbGUgPSBMTE1Bc0FKdWRnZU9yYWNsZShwcm9ibGVtPXNlbGYsIGV4cGVjdGVkPXNlbGYucm9vdF9jYXVzZSkKICAgIHNlbGYubWl0aWdhdGlvbl9vcmFjbGUgPSBNaXRpZ2F0aW9uT3JhY2xlKHByb2JsZW09c2VsZikhKlx0aWt6W3JlbWVtYmVyIHBpY3R1cmVdIFxjb29yZGluYXRlKG9yYS1lbmQpOyohCgogIEBtYXJrX2ZhdWx0X2luamVjdGVkCiAgZGVmIGluamVjdF9mYXVsdChzZWxmKTohKlx0aWt6W3JlbWVtYmVyIHBpY3R1cmVdIFxjb29yZGluYXRlIChmYXVsdC1zdGFydCk7KiEKICAgIGluamVjdG9yID0gTmV0d29ya1BvcnRGYXVsdEluamVjdG9yKG5hbWVzcGFjZT1zZWxmLm5hbWVzcGFjZSkKICAgIGluamVjdG9yLl9pbmplY3QoCiAgICAgIGZhdWx0X3R5cGU9InBvcnQtbWlzY29uZmlnIiwKICAgICAgbWljcm9zZXJ2aWNlPSJ1c2VyLXNlcnZpY2UiKSEqXHRpa3pbcmVtZW1iZXIgcGljdHVyZV0gXGNvb3JkaW5hdGUgKGZhdWx0LWVuZCk7KiE=)classK8sNetworkPortMisconfig\(Problem\):def\_\_init\_\_\(self\):self\.app=SocialNetwork\(\)self\.app\.create\_workload\(\)self\.root\_cause=\("Theuser\-servicehasamisconfigurednetworkport\[\.\.\.\]"\)self\.diagnosis\_oracle=LLMAsAJudgeOracle\(problem=self,expected=self\.root\_cause\)self\.mitigation\_oracle=MitigationOracle\(problem=self\)@mark\_fault\_injecteddefinject\_fault\(self\):injector=NetworkPortFaultInjector\(namespace=self\.namespace\)injector\.\_inject\(fault\_type="port\-misconfig",microservice="user\-service"\)

ApplicationOracleFault

Figure 3:Implementing a problem inSREGym\(noises are injected by the framework\)\.

Table 1:Fault and noise injectors inSREGym\(“K8s” refers to Kubernetes\)\.
MechanismSimulated FaultsKill a process or a podFail\-stop behaviorStress hardware\[[68](https://arxiv.org/html/2605.07161#bib.bib67)\]Fail\-slow behavior\[[32](https://arxiv.org/html/2605.07161#bib.bib35)\]Fail syscall via eBPFOS/hardware faults\[[16](https://arxiv.org/html/2605.07161#bib.bib107),[57](https://arxiv.org/html/2605.07161#bib.bib108)\]Corrupt a disk sector\[[25](https://arxiv.org/html/2605.07161#bib.bib106)\]Disk sector errors\[[65](https://arxiv.org/html/2605.07161#bib.bib33),[8](https://arxiv.org/html/2605.07161#bib.bib36)\]Fault in deploy\.yamlService mis\-deploymentFault in app\. configApp\. misconfiguration\[[81](https://arxiv.org/html/2605.07161#bib.bib19)\]Fault in K8s configK8s misconfiguration\[[90](https://arxiv.org/html/2605.07161#bib.bib71)\]Use buggy app\. codeCode bugsUse buggy app\. operatorMisoperations\[[28](https://arxiv.org/html/2605.07161#bib.bib26),[29](https://arxiv.org/html/2605.07161#bib.bib31)\]Increase client loadsService overloads\[[54](https://arxiv.org/html/2605.07161#bib.bib18)\]Pause/restart unrelated podsTemporal pod failuresInject latency / drop packetsNetwork delay / jitterStress resource of nodesNoisy neighbors\[[48](https://arxiv.org/html/2605.07161#bib.bib16),[58](https://arxiv.org/html/2605.07161#bib.bib17)\]

### 2\.4Creating SRE Problems by Composing Faults

SREGymoffers 90 problems \(with new ones being continuously added\)\. The problems can be easily mutated, e\.g\., with different noise patterns\.SREGymcurrently provides 50 fault primitives that can be applied to 139 deployable services across 5 supported applications\. With compatibility constraints \(certain faults can only be applied to specific services\),SREGymoffers 3,623 viable fault\-component pairs, without composing noises and multiple faults, a roughly 40×\\timesmultiplier over 90 scenarios \(see Appendix[C](https://arxiv.org/html/2605.07161#A3)\)\. This is a structural difference from benchmarks that ship as a fixed set of hardcoded scenarios: the curated 90 problems reflect what we have validated end\-to\-end, not the extent of whatSREGymcan express\. Figure[3](https://arxiv.org/html/2605.07161#S2.F3)shows the implementation of oneSREGymproblem\.

Fault and Noise Simulation\.Table[1](https://arxiv.org/html/2605.07161#S2.T1)shows the faults and noisesSREGymsimulates and the simulation mechanisms\. The faults include common ones like application misconfigurations and new onesSREGymimplements \(e\.g\., a tool that uses eBPF to simulate OS and hardware faults\)\. More fault injectors can be directly integrated inSREGym\. These faults are at different layers in the system stack and manifest via various symptoms\. They require SRE agents to have comprehensive knowledge and understanding of different system components and their interactions\. For example, faults injected into microservices require SRE agents to understand application logic and their dependencies\[[74](https://arxiv.org/html/2605.07161#bib.bib80),[75](https://arxiv.org/html/2605.07161#bib.bib82)\]\. Faults injected into Kubernetes controllers and operators require SRE agents to understand how Kubernetes manages the cluster and applications\[[29](https://arxiv.org/html/2605.07161#bib.bib31),[70](https://arxiv.org/html/2605.07161#bib.bib32)\]\. Lower\-level OS and hardware faults require SRE agents to understand vertical interactions of applications, OSes, and hardware\[[31](https://arxiv.org/html/2605.07161#bib.bib39)\]\. Each injector exposes a Python API; composing a compound failure reduces to invoking several injectors\.

We define noises as transient, self\-recovering disturbances \(e\.g\., a pod crashing and then being rescheduled and a few dropped requests\)—they are not failure root causes\.SREGyminjects these transient events with a configurable schedule\. The agents see symptoms from both offending faults and noises and must distinguish the two types\.

![Refer to caption](https://arxiv.org/html/2605.07161v1/x4.png)\(a\)Metastable failures
![Refer to caption](https://arxiv.org/html/2605.07161v1/x5.png)\(b\)Concurrent failures

Figure 4:SRE problems that create scenarios with different failure modes inSREGymFailure Modes\.SREGymenables developers to compose fault and noise injectors to create failure scenarios in different modes\. Our current problem set covers three important failure modes\.

- •Metastable failuresare self\-sustaining congestive collapses in which the system degrades in response to transient events \(e\.g\., a load surge\) but fails to recover after the trigger is removed\[[10](https://arxiv.org/html/2605.07161#bib.bib24),[36](https://arxiv.org/html/2605.07161#bib.bib6),[38](https://arxiv.org/html/2605.07161#bib.bib25)\]\. Metastable failures are known to be hard to diagnose, as they do not manifest in crashing behavior\. Figure[4\(a\)](https://arxiv.org/html/2605.07161#S2.F4.sf1)shows a metastable failure problem we created, which \(1\) sets overly aggressive gRPC configurations for connection timeout \(50ms\) and retry count \(30\), \(2\) pushes the application \(Hotel Reservation in DeathStarBench\[[27](https://arxiv.org/html/2605.07161#bib.bib43)\]\) into a vulnerable state with a high load of 3000 requests per second, and \(3\) triggers the metastable failure by a transient CPU stress\. In this way, all RPC requests are timed out and retried at the same time, causing a retry storm\.SREGym’s ability to coordinate different events is crucial to creating this problem\.
- •Concurrent failuresare compound failures caused by multiple independent failures occurring simultaneously\. Figure[4\(b\)](https://arxiv.org/html/2605.07161#S2.F4.sf2)depicts the problem where two faults are injected into the application \(Social Network in DeathStarBench\[[27](https://arxiv.org/html/2605.07161#bib.bib43)\]\): \(1\) a scheduler misconfiguration that makes an observability service pod unschedulable, and \(2\) a network port misconfiguration that fails theUserIDservice which further fails user requests\. The problem evaluates whether SRE agents can understand and prioritize the port misconfiguration as it directly affects service availability \(the scheduler misconfiguration only affects the observability service, which is invisible to end users\)\.
- •Correlated failuresare failures in which multiple components fail at the same time because they share a common cause or dependency\[[84](https://arxiv.org/html/2605.07161#bib.bib9),[26](https://arxiv.org/html/2605.07161#bib.bib11)\]\. The agent must recognize the correlations between the symptoms and connect them to the underlying root cause\(s\)\.

### 2\.5Evaluation Oracles

Designing oracles that fairly evaluate SRE agents is challenging\. Existing SRE benchmarks use oracles that evaluate diagnosis results by requiring SRE agents to strictly match predefined labels \(e\.g\., names of offending services and root causes\)\[[14](https://arxiv.org/html/2605.07161#bib.bib27),[77](https://arxiv.org/html/2605.07161#bib.bib120)\]; such oracles are brittle because predefined labels are often ambiguous and not mutually exclusive; our experience shows that agents often report correct results, but are incorrectly evaluated due to mismatching brittle labels\. ITBench\[[41](https://arxiv.org/html/2605.07161#bib.bib28)\]proposed Normalized Topology\-Aware Matching which requires fine\-grained annotations of failure\-propagation graphs\. However, we find that such a level of laborious effort is error\-prone and hard to scale\.

Diagnosis Oracle\.SREGymadopts a checklist\-based LLM\-as\-a\-judge protocol\[[91](https://arxiv.org/html/2605.07161#bib.bib73)\]that decomposes diagnosis evaluation into fine\-grained questions in a structured form, and produces stable verdicts across evaluators\. We follow the principles of decomposing evaluation into multi\-dimensional questions to improve inter\-evaluator agreement\[[45](https://arxiv.org/html/2605.07161#bib.bib68)\], and grounding the rubric in domain\-expert insight instead of LLM\-generated criteria\[[18](https://arxiv.org/html/2605.07161#bib.bib69),[72](https://arxiv.org/html/2605.07161#bib.bib70)\]\. Given a ground\-truth root\-cause descriptionggand an agent\-submitted diagnosisdd, the diagnosis oracle assembles a prompt containinggg,dd, and a checklist ofN=9N=9Yes/No questions organized intoK=3K=3dimensions\. An LLM evaluator returns, for each questionqq, a judgmentyq∈\{0,1\}y\_\{q\}\\\!\\in\\\!\\\{0,1\\\}together with supporting evidence and a confidence\. For dimensionkkwith a set of questionsQkQ\_\{k\}, the per\-dimension scoresks\_\{k\}is the fraction of affirmative answers:sk=1\|Qk\|∑q∈Qkyqs\_\{k\}=\\frac\{1\}\{\|Q\_\{k\}\|\}\\sum\_\{q\\in Q\_\{k\}\}y\_\{q\}, and theaggregated scoreis a weighted sumS=∑k=1KwkskS=\\sum\_\{k=1\}^\{K\}w\_\{k\}\\,s\_\{k\}\(we give all dimensions equal weight, i\.e\.,wk=13w\_\{k\}=\\frac\{1\}\{3\}\)\. The final verdict isv^=𝟙\[S≥τ\]\\hat\{v\}=\\mathbb\{1\}\[S\\geq\\tau\]with default thresholdτ=79\\tau\\\!=\\\!\\frac\{7\}\{9\}\(which forbids a submission to pass while missing an entire dimension\)\.

We find that decomposition into questions yields transparent, per\-dimension evaluation of*where*the agent’s understanding fell short\. Our three dimensions are:

- •Fault localization: Does the diagnosis identify the correct originating component, distinguishing it from downstream victims of cascading failures?
- •Fault characterization: Does the diagnosis capture the faults and concrete details \(e\.g\., the wrong port value, a specific environment variable\)?
- •Failure scope: Does the diagnosis avoid over\- or under\-attributing impact in terms of the components and symptoms?

Table 2:Pairwise inter\-evaluator agreement\.Judge AJudge BAgreeκ\\kappa![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x6.png)Sonnet\-4\.6Human0\.950\.90![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x7.png)Sonnet\-4\.6![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x8.png)GPT\-5\.40\.880\.76![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x9.png)Sonnet\-4\.6![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x10.png)K2\.50\.910\.82![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x11.png)GPT\-5\.4![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x12.png)K2\.50\.970\.94![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x13.png)GPT\-5\.4Human0\.850\.70![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x14.png)K2\.5Human0\.880\.76Table[2](https://arxiv.org/html/2605.07161#S2.T2)validates the oracle’s verdicts on a stratified sample of100100agent diagnosis results, independently labelled by a domain expert and scored by two alternate LLM evaluators \(GPT\-5\.4 and Kimi K2\.5\)\. The default oracle agrees with human experts at Cohen’sκ=0\.90\\kappa\\\!=\\\!0\.90\(almost perfect agreement\), while the two alternate LLM evaluators converge atκ=0\.94\\kappa\\\!=\\\!0\.94and retain substantial agreement with human experts \(κ=0\.70\\kappa\\\!=\\\!0\.70and=0\.76\\\!=\\\!0\.76for for GPT\-5\.4 and Kimi K2\.5\)\. The convergence between LLM evaluators suggests that the decision depends primarily on checklist decomposition, not the choice of LLMs \(a concern with single\-model LLM\-as\-a\-judge evaluation\[[91](https://arxiv.org/html/2605.07161#bib.bib73)\]\)\.

Mitigation Oracle\.The mitigation oracle is problem\-specific to accurately reflect whether the target failure is truly mitigated\. The oracle checks whether the target fault is resolved and whether the target system has recovered to a healthy state\. The mitigation oracle uses both client\-side observability such as user request success rate and system\-side observability of application processes, Kubernetes cluster, etc\. This allows us to evaluate the agent’s mitigation results on complicated failure scenarios, such as metastable failures and low\-level software/hardware failures\.

## 3Results

We useSREGymto evaluate three AI agents powered by frontier LLMs: Stratus\[[13](https://arxiv.org/html/2605.07161#bib.bib54)\], a state\-of\-the\-art SRE agent with two LLMs: Claude Sonnet\-4\.6 and Kimi\-k2\.5; and two coding agents: Claude Code\[[20](https://arxiv.org/html/2605.07161#bib.bib84)\]with Claude Sonnet\-4\.6 and Codex\[[21](https://arxiv.org/html/2605.07161#bib.bib88)\]with GPT\-5\.4 \(gpt\-5\.4\-2026\-03\-05\)\. Each agent\-model pair is evaluated with three runs per problem\. We use Claude Sonnet\-4\.6 for the diagnosis oracle \(see §[2\.5](https://arxiv.org/html/2605.07161#S2.SS5)\) consistently across the evaluation\.

For noise simulation, we randomly simulate two noise patterns every five minutes, each one lasting two minutes\. The noise simulation is done by the framework, not hardcoded in any problems\.

Evaluation Metrics\.We evaluate the agents on the success rates of diagnosis and mitigation tasks\. Diagnosis success rate measures whether the agent pinpoints the root causes of target failures correctly\. Mitigation success rate measures whether the agent successfully mitigates failures \(verified by the mitigation oracle\)\. We also measure end\-to\-end \(E2E\) success rate, referring to cases where the agent achieves both correct diagnosis and correct mitigation on the same run\. We also report Time\-To\-Diagnose \(TTD\), Time\-To\-Mitigate \(TTM\), and mean token usage per problem run\.

### 3\.1Overall Benchmarking Results

Table 3:Overall benchmark results onSREGym\. TTD and TTM are capped at the 1800\-second agent timeout, with timed\-out runs contributing the cap value, so failure cost is reflected in the mean rather than inducing survivorship bias\.■\\blacksquare: runs with noise injected;□\\square: runs without noise injected\.AgentModelNoiseDiag\. \(%\)↑\\uparrowMitig\. \(%\)↑\\uparrowE2E \(%\)↑\\uparrowTTD \(s\)↓\\downarrowTTM \(s\)↓\\downarrow\# Tokens↓\\downarrowStratus![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x15.png)Sonnet\-4\.6□\\square61\.5%78\.5%54\.8%114\.0771\.1812K■\\blacksquare51\.5%65\.5%40\.2%170\.5885\.0464K![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x16.png)K2\.5□\\square41\.3%60\.6%32\.9%674\.51348\.8413K■\\blacksquare38\.9%57\.3%30\.4%656\.41283\.2443KClaude Code![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x17.png)Sonnet\-4\.6□\\square72\.6%75\.6%60\.7%292\.5702\.01\.47M■\\blacksquare62\.6%76\.3%53\.7%314\.0736\.51\.71MCodex![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x18.png)GPT\-5\.4□\\square70\.0%65\.2%53\.3%176\.4376\.01\.98M■\\blacksquare59\.3%64\.0%45\.9%218\.1397\.71\.88MTable[3](https://arxiv.org/html/2605.07161#S3.T3)shows the overall results\.SREGympresents challenges to frontier AI agents and models, with diagnosis success rates ranging from 38\.9% to 72\.6% and mitigation success rates ranging from 57\.3% to 78\.5% across agent\-model pairs\. The SRE problems evaluate an agent’s ability to reason about complex interactions between system components, infer root causes from symptoms, and effectively use tools to operate systems, which frontier agents/models still face difficulties with\.

We find that agents show different characteristics in addressing SRE problems\. Claude Code shows the highest end\-to\-end success rates, compared to Stratus and Codex\. Stratus with Sonnet\-4\.6 has the highest mitigation success rate among all agents, attributed to its undo\-and\-retry mechanism\[[13](https://arxiv.org/html/2605.07161#bib.bib54)\]\. On the other hand, Stratus with Kimi K2\.5 has the lowest success rate, due to limited raw model capabilities\.

Token Cost\.Claude Code uses 3×\\timesmore tokens per run than Stratus, and Codex uses 3\.6×\\timesmore\. The reason is that coding agents are not optimized for processing the large volumes of observability data in SRE tasks\. Stratus, as an SRE agent, preprocesses observability data and only prompts LLMs with relevant data\.

Impact of Noises\.As shown in Table[3](https://arxiv.org/html/2605.07161#S3.T3), diagnosis success rate drops foreveryagent\-model pair\. Mitigation success is more robust than diagnosis under noises\. Noises distract an agent’s hypothesis about the root causes; on the other hand, agents tend to recover from incorrect hypotheses with self\-validation \(see §[3\.3](https://arxiv.org/html/2605.07161#S3.SS3)\)\. Inspecting trajectories, we observe that all the evaluated agents take agreedyapproach \(see Appendix[D](https://arxiv.org/html/2605.07161#A4)\): they always treat the first plausible anomaly as the target failure \(which can be a noise\)\. In several cases, the agent did find evidence of the root cause of the target failure, but disregarded it because it was irrelevant to the noises they were \(wrongly\) targeting\.

### 3\.2Results on New Failure Scenarios

Table[4](https://arxiv.org/html/2605.07161#S3.T4)partitionsSREGym’s problems by failure scenarios\. We find that problems ported from existing benchmarks\[[14](https://arxiv.org/html/2605.07161#bib.bib27),[41](https://arxiv.org/html/2605.07161#bib.bib28)\]introduce limited challenges for the evaluated agents with strong models\. For example, the mitigation success rate is above 80% for Stratus with Sonnet\-4\.6\. Most of these problems are single\-fault scenarios focusing on applications and do not include noises\.SREGymincludesnewproblems that are built on similar failure scenarios in terms of fault families, but differ in applications and system components\. As shown in Table[4](https://arxiv.org/html/2605.07161#S3.T4), evaluated agents show similar results\.

However, Table[4](https://arxiv.org/html/2605.07161#S3.T4)shows that the agents perform significantly worse in new failure scenarios that are unique toSREGym\. The end\-to\-end success rates of Stratus with Sonnet\-4\.6, Claude Code, and Codex decrease from 63\.7% to 17\.9%, 60\.8% to 28\.2%, and 57\.8% to 15\.4%, respectively\. These results show significant gaps in AI agents’ ability to address high\-fidelity failures such as those rooted in low\-level stacks and/or caused by compound failures\. We briefly describe two case studies, with more details in Appendix[D](https://arxiv.org/html/2605.07161#A4)\.

Hardware faults\.SREGym’slatent\_sector\_errorproblem injects intermittent errors onreadsystem calls into a node\. Across three runs of Stratus and Claude Code, no run produced an aggregated diagnosis score above0\.220\.22, and fault characterization received a score of0in every run\. All agents attribute the errors to error\-handling logic inside the application\. None of the runs proposed disk\-level diagnostics \(e\.g\., readingdmesgand inspectingsmartctloutput\)\.

Metastable failures\.The evaluated agents reliably diagnose the application\-level trigger, which is visible in distributed traces and deployment configurations; however, no agent across the metastable failure problems identifiedbothinteracting components\. In one run, Codex did locate the resource constraint but then dismissed the application trigger as a downstream artifact, producing a diagnosis that was correct on localization but wrong on scope \(§[2\.5](https://arxiv.org/html/2605.07161#S2.SS5)\)\.

In general, we observe a lack of coherent, comprehensive understanding between the control/management plane of the cluster, the deployed systems, and the user requests\. On metastable failures, agents did not connect the system\-level constraints with the application\-level trigger: this is required to identify how the system is induced into the metastable state\. In hardware failures, agents also miss the hardware\-software interactions\. This lack of understanding limits the agent’s ability to diagnose complex failures inSREGym, which are arguably closer to real\-world incidents\.

Table 4:Benchmark results partitioned into three problem types\. “Ported” refers to problems directly ported from AIOpsLab/ITBench; “Similar Failures” are new problems inSREGymthat share failure patterns with “Ported”; “New failures” are faults and failure modes unique toSREGym\.Ported\(n=34n\{=\}34\)Similar Failures\(n=43n\{=\}43\)New Failures\(n=13n\{=\}13\)AgentNoiseDiag\.Mitig\.E2EDiag\.Mitig\.E2EDiag\.Mitig\.E2EStratus![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x19.png)□\\square70\.6%83\.3%63\.7%62\.8%80\.6%58\.9%33\.3%59\.0%17\.9%■\\blacksquare58\.8%68\.6%45\.1%55\.0%64\.3%44\.2%20\.5%30\.8%10\.3%Stratus![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x20.png)□\\square42\.2%46\.1%27\.5%46\.5%44\.2%32\.6%15\.4%12\.8%10\.3%■\\blacksquare39\.2%49\.0%27\.5%43\.1%42\.3%30\.0%17\.9%23\.1%12\.8%Claude Code□\\square74\.5%71\.6%60\.8%81\.4%79\.1%70\.5%38\.5%74\.4%28\.2%■\\blacksquare66\.7%73\.5%52\.0%62\.8%78\.3%56\.6%51\.3%76\.9%48\.7%Codex□\\square76\.5%66\.7%57\.8%78\.3%68\.2%61\.2%25\.6%41\.0%15\.4%■\\blacksquare66\.7%66\.7%51\.0%62\.8%68\.2%50\.4%28\.2%28\.2%17\.9%Table 5:Conditional mitigation \(M\) probability with diagnosis \(D\) outcome\.AgentNoiseP\(M∣D\)P\(\\text\{M\}\\mid\\text\{D\}\)P\(M∣¬D\)P\(\\text\{M\}\\mid\\neg\\text\{D\}\)Stratus![Refer to caption](https://arxiv.org/html/2605.07161v1/x21.png)□\\square0\.8800\.588Stratus![Refer to caption](https://arxiv.org/html/2605.07161v1/x22.png)■\\blacksquare0\.7380\.506Stratus![Refer to caption](https://arxiv.org/html/2605.07161v1/x23.png)□\\square0\.6900\.351Stratus![Refer to caption](https://arxiv.org/html/2605.07161v1/x24.png)■\\blacksquare0\.7080\.396Claude Code□\\square0\.7980\.500Claude Code■\\blacksquare0\.8340\.571Codex□\\square0\.7340\.435Codex■\\blacksquare0\.7730\.431![Refer to caption](https://arxiv.org/html/2605.07161v1/x25.png)
Figure 5:The average number of tool calls per SRE problem by category\. S: Sonnet 4\.6, K: Kimi K2\.5;■\\blacksquare: noises\.

### 3\.3Correlations between Diagnosis and Mitigation

When an agent succeeds at mitigation, was that success backed by a correct diagnosis, or did the agent stumble into a fix without knowing why? Table[5](https://arxiv.org/html/2605.07161#S3.T5)shows conditional probabilities of mitigation given diagnosis outcome\. Agents may pinpoint the root cause but fail in mitigation: in cases where the agent correctly diagnoses the failure, mitigation succeeds only 69%–88% of the time \(P\(M∣D\)P\(\\text\{M\}\\mid\\text\{D\}\)\)\. When the initial diagnosis fails to address the root cause, mitigation success rate drops 23%–34% across agents\. The drop shows that diagnosis is helpful in guiding the agent in its mitigation\.

When the diagnosis is wrong, the agent may still successfully mitigate the failure by continuously observing and tuning the systems\. Mitigation succeeds 35–59% of the time across agents when initial diagnosis is incorrect \(P\(M∣¬D\)P\(\\text\{M\}\\mid\\neg\\text\{D\}\)\)\. Note that the benchmark does not inform the agent whether its diagnosis is correct or not\. We observe two common patterns: \(1\) the agent often pattern\-matches a known symptom to mitigation actions that can successfully mitigate the failure without understanding its root cause; \(2\) the agent can correct its incorrect diagnosis by observing the persistence of the failure and continuously forming new hypotheses to make further attempts\. For example, without noises injected, when the diagnosis is incorrect, Stratus with Sonnet\-4\.6 averages 3\.82 attempts during mitigation, compared with 1\.88 attempts with correct diagnosis \(see Table[12](https://arxiv.org/html/2605.07161#A5.T12)\)\.

### 3\.4Tool Usage

Figure[5](https://arxiv.org/html/2605.07161#S3.F5)classifies tool calls in the agent trajectories \(Appendix[E](https://arxiv.org/html/2605.07161#A5)shows more details\)\. The majority \(60–72%\) of tool calls are read\-onlykubectlcommands\. Two commands account for around 87% of all read operations:kubectlgetto inspect system state andkubectllogsto read container logs\. The primary resources inspected are Pods\[[61](https://arxiv.org/html/2605.07161#bib.bib149)\], Deployments\[[24](https://arxiv.org/html/2605.07161#bib.bib148)\], and ConfigMaps\[[22](https://arxiv.org/html/2605.07161#bib.bib147)\]\. Agents execute about 19–28 read commands before their first mitigation actions \(see Table[9](https://arxiv.org/html/2605.07161#A5.T9)\)\.

Agents differ in their mitigation actions\. Stratus favorskubectlpatch\(39–41% of write actions\) to change system state, while Claude Code and Codex preferkubectlrollout\(29–40% of write actions\), which restarts deployments\. Codex additionally relies onkubectlrun\(19–21%\) to spin up ephemeral containers for in\-cluster network testing andkubectlport\-forward\(11–12%\) for direct service probing, which are tools Status and Claude Code rarely use\.

Stratus has custom\-built API tools for querying metrics, traces, and service dependency graphs, which account for 15–17% of its tool calls\. Claude Code and Codex instead discover observability endpoints and construct rawcurlcommands; their observability access accounts for only 2–3% of tool calls\.

## 4Related Work

The advances of AI for SWE \(Software Engineering\) have pushed AI for SRE to be the next frontier\. Recent work has made active progress on agentic SRE technologies, from root cause analysis \(RCA\) to failure mitigation\[[15](https://arxiv.org/html/2605.07161#bib.bib53),[59](https://arxiv.org/html/2605.07161#bib.bib127),[78](https://arxiv.org/html/2605.07161#bib.bib129),[87](https://arxiv.org/html/2605.07161#bib.bib130),[88](https://arxiv.org/html/2605.07161#bib.bib132),[71](https://arxiv.org/html/2605.07161#bib.bib135),[51](https://arxiv.org/html/2605.07161#bib.bib137),[85](https://arxiv.org/html/2605.07161#bib.bib143),[52](https://arxiv.org/html/2605.07161#bib.bib141),[92](https://arxiv.org/html/2605.07161#bib.bib125),[13](https://arxiv.org/html/2605.07161#bib.bib54)\]; meanwhile, many commercial and open\-source SRE agent products are developed\[[64](https://arxiv.org/html/2605.07161#bib.bib56),[19](https://arxiv.org/html/2605.07161#bib.bib95),[73](https://arxiv.org/html/2605.07161#bib.bib123),[7](https://arxiv.org/html/2605.07161#bib.bib90),[6](https://arxiv.org/html/2605.07161#bib.bib91)\]\. Our communications with many SRE researchers and practitioners show that high\-quality SRE benchmarks are highly desired\.

A few benchmarks provide static datasets of Q&A\[[47](https://arxiv.org/html/2605.07161#bib.bib76),[67](https://arxiv.org/html/2605.07161#bib.bib110)\], observability data from real/synthetic sources\[[33](https://arxiv.org/html/2605.07161#bib.bib77),[39](https://arxiv.org/html/2605.07161#bib.bib78),[79](https://arxiv.org/html/2605.07161#bib.bib83)\], and system snapshots\[[77](https://arxiv.org/html/2605.07161#bib.bib120)\]\.SREGymcan also be used to generate such datasets\. These datasets are valuable for anomaly detection and RCA, but are fundamentally limited as they do not provide opportunities for agents to iteratively probe and observe the systems as in real\-world diagnosis process\. For the same reason, these benchmarks cannot support mitigation tasks\.

The design ofSREGymcomes from the reflection on existing live benchmarks\[[14](https://arxiv.org/html/2605.07161#bib.bib27),[41](https://arxiv.org/html/2605.07161#bib.bib28),[86](https://arxiv.org/html/2605.07161#bib.bib86)\]which are limited in their abilities to simulate high\-fidelity, realistic failure scenarios, due to the lack of system support for orchestrating distributed events, low\-level fault simulation and injection mechanisms, noise simulation, etc\. Existing live benchmarks also suffer from engineering practices that misuse chaos engineering tools, lack protection against reward hacking, etc\. \(see Appendix[B](https://arxiv.org/html/2605.07161#A2)\)\.

SREGymfocuses on failure analysis and mitigation and thus is complementary to other related benchmarks for deployment and regression testing\[[77](https://arxiv.org/html/2605.07161#bib.bib120)\]and terminal\-based environment setup\[[53](https://arxiv.org/html/2605.07161#bib.bib102)\]\.

## 5Discussion and Conclusion

The aspiration ofSREGymis to push the standard of AI for SRE and to unlock a whole new evolution of agentic SRE technologies\. With this goal in mind, we developSREGymas a high\-fidelity benchmark to represent existing challenges in real\-world production environments and as a usable, extensible framework that can be continously evolved with new challenges for AI\. Several components ofSREGymcan be further enhanced and enriched, including more diverse noise modeling, more comprehensive fault simulation and failure modes, and system environments \(e\.g\., to include edge presences\)—some are being started\. A clear next step is to upgradeSREGyminto a reinforcement\-learning \(RL\) style training ground for SRE agents beyond its current problem set\.

## Acknowledgement

We thank everyone who has supported, helped, and contributed toSREGym\. Specifically, we thank all the contributors toSREGymand all the users ofSREGymwho give us encouragement and feedback\. We thank Braden Hancock, Andy Konwinski, Brighten Godfrey, Sasa Misailovic, Xuan Feng, Lidong Zhou, and Darko Marinov for valuable feedback on the project\.SREGymis supported in part by a Slingshot grant from the Laude Institute and by NSF CNS\-2145295\.

## References

- \[1\]\(2025\)2025 Stack Overflow Developer Survey\.Note:[https://survey\.stackoverflow\.co/2025/](https://survey.stackoverflow.co/2025/)Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p1.1)\.
- \[2\]\(2026\)43% of AI\-generated code changes need debugging in production, survey finds\.Note:[https://venturebeat\.com/technology/43\-of\-ai\-generated\-code\-changes\-need\-debugging\-in\-production\-survey\-finds](https://venturebeat.com/technology/43-of-ai-generated-code-changes-need-debugging-in-production-survey-finds)Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p1.1)\.
- \[3\]A\. Alquraan, H\. Takruri, M\. Alfatafta, and S\. Al\-Kiswany\(2018\-10\)An Analysis of Network\-Partitioning Failures in Cloud Systems\.InProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation \(OSDI’18\),Cited by:[Appendix H](https://arxiv.org/html/2605.07161#A8.p3.1)\.
- \[4\]\(2026\-03\)Amazon tightens code controls after outages, including one caused by AI\.Note:[https://www\.businessinsider\.com/amazon\-tightens\-code\-controls\-after\-outages\-including\-one\-ai\-2026\-3](https://www.businessinsider.com/amazon-tightens-code-controls-after-outages-including-one-ai-2026-3)Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p1.1)\.
- \[5\]A\. Avizienis, J\. Laprie, B\. Randell, and C\. Landwehr\(2004\-01\)Basic Concepts and Taxonomy of Dependable and Secure Computing\.IEEE Transactions on Dependable and Secure Computing \(TDSC\)1\(1\),pp\. 1–23\.Cited by:[footnote 1](https://arxiv.org/html/2605.07161#footnote1)\.
- \[6\]\(2025\)AWS DevOps Agent\.Note:[https://aws\.amazon\.com/devops\-agent/](https://aws.amazon.com/devops-agent/)Cited by:[Appendix G](https://arxiv.org/html/2605.07161#A7.p2.1),[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[7\]\(2025\)Azure SRE Agent\.Note:[https://azure\.microsoft\.com/en\-us/products/sre\-agent](https://azure.microsoft.com/en-us/products/sre-agent)Cited by:[Appendix G](https://arxiv.org/html/2605.07161#A7.p2.1),[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[8\]L\. N\. Bairavasundaram, G\. R\. Goodson, S\. Pasupathy, and J\. Schindler\(2007\-06\)An Analysis of Latent Sector Errors in Disk Drives\.InProceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems \(SIGMETRICS’07\),Cited by:[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.5.2)\.
- \[9\]P\. Barham, A\. Donnelly, R\. Isaacs, and R\. Mortier\(2004\-12\)Using Magpie for Request Extraction and Workload Modelling\.InProceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation \(OSDI’04\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p2.1)\.
- \[10\]N\. Bronson, A\. Aghayev, A\. Charapko, and T\. Zhu\(2021\-06\)Metastable Failures in Distributed Systems\.InProceedings of the 18th Workshop on Hot Topics in Operating Systems \(HotOS’21\),Cited by:[§D\.1](https://arxiv.org/html/2605.07161#A4.SS1.p1.1),[item 3](https://arxiv.org/html/2605.07161#S1.I1.i3.p1.1),[1st item](https://arxiv.org/html/2605.07161#S2.I5.i1.p1.1)\.
- \[11\]Chaos mesh: a powerful chaos engineering platform for kubernetes\.Note:[https://chaos\-mesh\.org](https://chaos-mesh.org/)Cited by:[Appendix B](https://arxiv.org/html/2605.07161#A2.p5.1)\.
- \[12\]\(2019\)Chaosblade: An Easy to Use and Powerful Chaos Engineering Toolkit\.Note:[https://github\.com/chaosblade\-io/chaosblade](https://github.com/chaosblade-io/chaosblade)Cited by:[Appendix B](https://arxiv.org/html/2605.07161#A2.p5.1)\.
- \[13\]Y\. Chen, J\. Pan, J\. Clark, Y\. Su, N\. Zheutlin, B\. Bhavya, R\. Arora, Y\. Deng, S\. Jha, and T\. Xu\(2025\-12\)Stratus: A Multi\-agent System for Autonomous Reliability Engineering of Modern Clouds\.InProceedings of The 39th Annual Conference on Neural Information Processing Systems \(NeurIPS’25\),Cited by:[Appendix B](https://arxiv.org/html/2605.07161#A2.p3.1),[Appendix B](https://arxiv.org/html/2605.07161#A2.p6.1),[§1](https://arxiv.org/html/2605.07161#S1.p7.1),[§3\.1](https://arxiv.org/html/2605.07161#S3.SS1.p2.1),[§3](https://arxiv.org/html/2605.07161#S3.p1.1),[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[14\]Y\. Chen, M\. Shetty, G\. Somashekar, M\. Ma, Y\. Simmhan, J\. Mace, C\. Bansal, R\. Wang, and S\. Rajmohan\(2025\-05\)AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds\.InProceedings of the 8th Conference on Machine Learning and Systems \(MLSys’25\),Cited by:[Appendix B](https://arxiv.org/html/2605.07161#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.07161#A2.p3.1),[Appendix B](https://arxiv.org/html/2605.07161#A2.p6.1),[§1](https://arxiv.org/html/2605.07161#S1.p3.1),[§1](https://arxiv.org/html/2605.07161#S1.p4.1),[§2\.5](https://arxiv.org/html/2605.07161#S2.SS5.p1.1),[§3\.2](https://arxiv.org/html/2605.07161#S3.SS2.p1.1),[§4](https://arxiv.org/html/2605.07161#S4.p3.1),[footnote 2](https://arxiv.org/html/2605.07161#footnote2)\.
- \[15\]Y\. Chen, H\. Xie, M\. Ma, Y\. Kang, X\. Gao, L\. Shi, Y\. Cao, X\. Gao, H\. Fan, M\. Wen, J\. Zeng, S\. Ghosh, X\. Zhang, C\. Zhang, Q\. Lin, S\. Rajmohan, D\. Zhang, and T\. Xu\(2024\-04\)Automatic Root Cause Analysis via Large Language Models for Cloud Incidents\.InProceedings of the 19th European Conference on Computer Systems \(EuroSys’24\),Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[16\]A\. Chou, J\. Yang, B\. Chelf, S\. Hallem, and D\. Engler\(2001\-10\)An Empirical Study of Operating Systems Errors\.InProceedings of the 18th ACM Symposium on Operating Systems Principles \(SOSP’01\),Cited by:[item 1](https://arxiv.org/html/2605.07161#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.07161#S1.p3.1),[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.4.2)\.
- \[17\]M\. Chow, Y\. Wang, W\. Wang, A\. Hailu, R\. Bopardikar, B\. Zhang, J\. Qu, D\. Meisner, S\. Sonawane, Y\. Zhang, R\. Paim, M\. Ward, I\. Huang, M\. McNally, D\. Hodges, Z\. Farkas, C\. Gocmen, E\. Huang, and C\. Tang\(2024\-07\)ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre\-Production Testing\.InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation \(OSDI’24\),Cited by:[Appendix H](https://arxiv.org/html/2605.07161#A8.p3.1)\.
- \[18\]S\. Y\. Chu, J\. W\. Kim, and M\. Y\. Yi\(2025\-04\)Think Together and Work Better: Combining Humans’ and LLMs’ Think\-Aloud Outcomes for Effective Text Evaluation\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems \(CHI’25\),Cited by:[§2\.5](https://arxiv.org/html/2605.07161#S2.SS5.p2.16)\.
- \[19\]\(2025\)Ciroos \- Reduce toil, investigate incidents faster, and drive autonomous operations\.Note:[https://ciroos\.ai/](https://ciroos.ai/)Cited by:[Appendix G](https://arxiv.org/html/2605.07161#A7.p2.1),[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[20\]Claude Code by Anthropic\.Note:[https://code\.claude\.com](https://code.claude.com/)Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p7.1),[§3](https://arxiv.org/html/2605.07161#S3.p1.1)\.
- \[21\]Codex: cloud coding agent\.Note:[https://chatgpt\.com/codex](https://chatgpt.com/codex)Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p7.1),[§3](https://arxiv.org/html/2605.07161#S3.p1.1)\.
- \[22\]ConfigMaps\.Note:[https://kubernetes\.io/docs/concepts/configuration/configmap](https://kubernetes.io/docs/concepts/configuration/configmap)Cited by:[§3\.4](https://arxiv.org/html/2605.07161#S3.SS4.p1.1)\.
- \[23\]\(2026\)Demystifying Evals for AI Agents\.Note:[https://www\.anthropic\.com/engineering/demystifying\-evals\-for\-ai\-agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)Cited by:[2nd item](https://arxiv.org/html/2605.07161#S2.I3.i2.p1.2)\.
- \[24\]Deployment\.Note:[https://kubernetes\.io/docs/concepts/workloads/controllers/deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment)Cited by:[§3\.4](https://arxiv.org/html/2605.07161#S3.SS4.p1.1)\.
- \[25\]\(2025\)dm\-dust \- A Linux kernel module which can be used to simulate the bad blocks behavior on a physical disk\.Note:[https://docs\.kernel\.org/admin\-guide/device\-mapper/dm\-dust\.html](https://docs.kernel.org/admin-guide/device-mapper/dm-dust.html)Cited by:[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.5.1)\.
- \[26\]D\. Ford, F\. Labelle, F\. I\. Popovici, M\. Stokely, V\. Truong, L\. Barroso, C\. Grimes, and S\. Quinlan\(2010\-10\)Availability in Globally Distributed Storage Systems\.InProceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation \(OSDI’10\),Cited by:[item 3](https://arxiv.org/html/2605.07161#S1.I1.i3.p1.1),[3rd item](https://arxiv.org/html/2605.07161#S2.I5.i3.p1.1)\.
- \[27\]Y\. Gan, Y\. Zhang, D\. Cheng, A\. Shetty, P\. Rathi, N\. Katarki, A\. Bruno, J\. Hu, B\. Ritchken, B\. Jackson, K\. Hu, M\. Pancholi, Y\. He, B\. Clancy, C\. Colen, F\. Wen, C\. Leung, S\. Wang, L\. Zaruvinsky, M\. Espinosa, R\. Lin, Z\. Liu, J\. Padilla, and C\. Delimitrou\(2019\-04\)An Open\-Source Benchmark Suite for Microservices and Their Hardware\-Software Implications for Cloud & Edge Systems\.InProceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems \(ASPLOS’19\),Cited by:[Appendix H](https://arxiv.org/html/2605.07161#A8.p4.1),[1st item](https://arxiv.org/html/2605.07161#S2.I5.i1.p1.1),[2nd item](https://arxiv.org/html/2605.07161#S2.I5.i2.p1.1),[§2\.2](https://arxiv.org/html/2605.07161#S2.SS2.p1.1)\.
- \[28\]J\. T\. Gu, X\. Sun, W\. Zhang, Y\. Jiang, C\. Wang, M\. Vaziri, O\. Legunsen, and T\. Xu\(2023\-10\)Acto: Automatic End\-to\-End Testing for Operation Correctness of Cloud System Management\.InProceedings of the 29th ACM Symposium on Operating Systems Principles \(SOSP’23\),Cited by:[item 1](https://arxiv.org/html/2605.07161#S1.I1.i1.p1.1),[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.10.2)\.
- \[29\]J\. T\. Gu, Z\. Tang, Y\. Su, B\. A\. Stoica, X\. Sun, W\. X\. Zheng, Y\. Zhang, A\. Rahman, C\. Wang, and T\. Xu\(2026\-05\)Who Watches the Watchers? On the Reliability of Softwarizing Cloud Application Management\.InProceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation \(NSDI’26\),Cited by:[§2\.4](https://arxiv.org/html/2605.07161#S2.SS4.p2.1),[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.10.2)\.
- \[30\]H\. S\. Gunawi, M\. Hao, R\. O\. Suminto, A\. Laksono, A\. D\. Satria, J\. Adityatama, and K\. J\. Eliazar\(2016\-10\)Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages\.InProceedings of the 7th ACM Symposium on Cloud Computing \(SoCC’16\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p3.1)\.
- \[31\]H\. S\. Gunawi, C\. Rubio\-González, A\. C\. Arpaci\-Dusseau, R\. H\. Arpaci\-Dusseau, and B\. Liblit\(2008\-02\)EIO: Error Handling is Occasionally Correct\.InProceedings of the 6th USENIX Conference on File and Storage Technologies \(FAST’08\),Cited by:[§2\.4](https://arxiv.org/html/2605.07161#S2.SS4.p2.1)\.
- \[32\]H\. S\. Gunawi, R\. O\. Suminto, R\. Sears, C\. Golliher, S\. Sundararaman, X\. Lin, T\. Emami, W\. Sheng, N\. Bidokhti, C\. McCaffrey, G\. Grider, P\. M\. Fields, K\. Harms, R\. B\. Ross, A\. Jacobson, R\. Ricci, K\. Webb, P\. Alvaro, H\. B\. Runesha, M\. Hao, and H\. Li\(2018\-02\)Fail\-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems\.InProceedings of the 16th USENIX Conference on File and Storage Technologies \(FAST’18\),Cited by:[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.3.2)\.
- \[33\]S\. Han, X\. Hu, H\. Huang, M\. Jiang, and Y\. Zhao\(2022\-11\)ADBench: Anomaly Detection Benchmark\.InProceedings of the 36th Annual Conference on Neural Information Processing Systems \(NeurIPS’22\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p3.1),[§4](https://arxiv.org/html/2605.07161#S4.p2.1)\.
- \[34\]Helm \- The package manager for Kubernetes\.Note:[https://helm\.sh](https://helm.sh/)Cited by:[§2\.2](https://arxiv.org/html/2605.07161#S2.SS2.p1.1)\.
- \[35\]P\. H\. Hochschild, P\. Turner, J\. C\. Mogul, R\. Govindaraju, P\. Ranganathan, D\. E\. Culler, and A\. Vahdat\(2021\-06\)Cores that don’t count\.InProceedings of the ACM SIGOPS 21st Workshop on Hot Topics in Operating Systems \(HotOS’21\),Cited by:[item 1](https://arxiv.org/html/2605.07161#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.07161#S1.p3.1)\.
- \[36\]L\. Huang, M\. Magnusson, A\. B\. Muralikrishna, S\. Estyak, R\. Isaacs, A\. Aghayev, T\. Zhu, and A\. Charapko\(2022\-07\)Metastable Failures in the Wild\.InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation \(OSDI’22\),Cited by:[§D\.1](https://arxiv.org/html/2605.07161#A4.SS1.p1.1),[item 3](https://arxiv.org/html/2605.07161#S1.I1.i3.p1.1),[1st item](https://arxiv.org/html/2605.07161#S2.I5.i1.p1.1)\.
- \[37\]P\. Huang, C\. Guo, L\. Zhou, J\. R\. Lorch, Y\. Dang, M\. Chintalapati, and R\. Yao\(2017\-05\)Gray Failure: The Achilles’ Heel of Cloud\-Scale Systems\.InProceedings of the 16th Workshop on Hot Topics in Operating Systems \(HotOS’17\),pp\. 150–155\.Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p3.1)\.
- \[38\]R\. Isaacs, P\. Alvaro, R\. Majumdar, K\. Muniswamy\-Reddy, M\. Salamati, and S\. Soudjani\(2025\-05\)Analyzing Metastable Failures\.InProceedings of the ACM SIGOPS 20th Workshop on Hot Topics in Operating Systems \(HotOS’25\),Cited by:[§D\.1](https://arxiv.org/html/2605.07161#A4.SS1.p1.1),[item 3](https://arxiv.org/html/2605.07161#S1.I1.i3.p1.1),[1st item](https://arxiv.org/html/2605.07161#S2.I5.i1.p1.1)\.
- \[39\]V\. Jacob, F\. Song, A\. Stiegler, B\. Rad, Y\. Diao, and N\. Tatbul\(2021\-07\)Exathlon: a Benchmark for Explainable Anomaly Detection Over Time Series\.Proceedings of the VLDB Endowment \(VLDB’21\)14\(11\),pp\. 2613–2626\.Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p3.1),[§4](https://arxiv.org/html/2605.07161#S4.p2.1)\.
- \[40\]Jaeger: open source, distributed tracing platform\.Note:[https://www\.jaegertracing\.io](https://www.jaegertracing.io/)Cited by:[3rd item](https://arxiv.org/html/2605.07161#S2.I4.i3.p1.1)\.
- \[41\]S\. Jha, R\. R\. Arora, Y\. Watanabe, T\. Yanagawa, Y\. Chen, J\. Clark, B\. Bhavya, M\. Verma, H\. Kumar, H\. Kitahara, N\. Zheutlin, S\. Takano, D\. Pathak, F\. George, X\. Wu, B\. O\. Turkkan, G\. Vanloo, M\. Nidd, T\. Dai, O\. Chatterjee, P\. Gupta, S\. Samanta, P\. Aggarwal, R\. Lee, J\. Ahn, D\. Kar, A\. Paradkar, Y\. Deng, P\. Moogi, P\. Mohapatra, N\. Abe, C\. Narayanaswami, T\. Xu, L\. R\. Varshney, R\. Mahindru, A\. Sailer, L\. Shwartz, D\. Sow, N\. C\. M\. Fuller, and R\. Puri\(2025\-05\)ITBench: Evaluating AI Agents across Diverse Real\-World IT Automation Tasks\.InProceedings of the 42nd International Conference on Machine Learning \(ICML’25\),Cited by:[Appendix B](https://arxiv.org/html/2605.07161#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.07161#A2.p4.1),[Appendix B](https://arxiv.org/html/2605.07161#A2.p5.1),[Appendix B](https://arxiv.org/html/2605.07161#A2.p6.1),[§1](https://arxiv.org/html/2605.07161#S1.p3.1),[§1](https://arxiv.org/html/2605.07161#S1.p4.1),[§2\.5](https://arxiv.org/html/2605.07161#S2.SS5.p1.1),[§3\.2](https://arxiv.org/html/2605.07161#S3.SS2.p1.1),[§4](https://arxiv.org/html/2605.07161#S4.p3.1)\.
- \[42\]S\. Jha, S\. Cui, S\. Banerjee, T\. Xu, J\. Enos, M\. Showerman, Z\. T\. Kalbarczyk, and R\. K\. Iyer\(2020\-11\)Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems\.InProceedings of the International Conference for High\-Performance Computing, Networking, Storage and Analysis \(SC’20\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p2.1)\.
- \[43\]J\. Kaldor, J\. Mace, M\. Bejda, E\. Gao, W\. Kuropatwa, J\. O’Neill, K\. W\. Ong, B\. Schaller, P\. Shan, B\. Viscomi, V\. Venkataraman, K\. Veeraraghavan, and Y\. J\. Song\(2017\-10\)Canopy: An End\-to\-End Performance Tracing and Analysis System\.InProceedings of the 26th Symposium on Operating Systems Principles \(SOSP’17\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p2.1)\.
- \[44\]J\. Laprie\(1995\-06\)Dependable Computing: Concepts, Limits, Challenges\.InProceedings of the 25th IEEE International Symposium on Fault\-Tolerant Computing \(FTCS’95\),Cited by:[footnote 1](https://arxiv.org/html/2605.07161#footnote1)\.
- \[45\]Y\. Lee, J\. Kim, J\. Kim, H\. Cho, J\. Kang, P\. Kang, and N\. Kim\(2025\-11\)CheckEval: A Reliable LLM\-as\-a\-Judge Framework for Evaluating Text Generation Using Checklists\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP’25\),Cited by:[§2\.5](https://arxiv.org/html/2605.07161#S2.SS5.p2.16)\.
- \[46\]H\. Liu, S\. Lu, M\. Musuvathi, and S\. Nath\(2019\-05\)What Bugs Cause Production Cloud Incidents?\.InProceedings of the 17th Workshop on Hot Topics in Operating Systems \(HotOS’19\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p3.1)\.
- \[47\]Y\. Liu, C\. Pei, L\. Xu, B\. Chen, M\. Sun, Z\. Zhang, Y\. Sun, S\. Zhang, K\. Wang, H\. Zhang, J\. Li, G\. Xie, X\. Wen, X\. Nie, M\. Ma, and D\. Pei\(2025\-06\)OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models\.arXiv:2310\.07637\.Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p3.1),[§4](https://arxiv.org/html/2605.07161#S4.p2.1)\.
- \[48\]D\. Lo, L\. Cheng, R\. Govindaraju, P\. Ranganathan, and C\. Kozyrakis\(2015\-06\)Heracles: Improving Resource Efﬁciency at Scale\.InProceedings of the 42nd Annual International Symposium on Computer Architecture \(ISCA’15\),Cited by:[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.14.2)\.
- \[49\]D\. Loker\(2026\)AI vs Human Code Gen Report: AI Code Creates 1\.7x More Issues\.Note:[https://www\.coderabbit\.ai/blog/state\-of\-ai\-vs\-human\-code\-generation\-report](https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report)Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p1.1)\.
- \[50\]Loki \- a horizontally\-scalable, highly\-available, multi\-tenant log aggregation system inspired by Prometheus\.\.Note:[https://github\.com/grafana/loki](https://github.com/grafana/loki)Cited by:[2nd item](https://arxiv.org/html/2605.07161#S2.I4.i2.p1.1)\.
- \[51\]Y\. Luo, J\. Jiang, J\. Feng, L\. Tao, Q\. Zhang, X\. Wen, Y\. Sun, S\. Zhang, and D\. Pei\(2025\-11\)From Observability Data to Diagnosis: An Evolving Multi\-agent System for Incident Management in Cloud Systems\.arXiv:2510\.24145\.Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[52\]J\. Mao, L\. Li, Y\. Gao, Z\. Peng, S\. He, C\. Zhang, S\. Qin, S\. Khalid, Q\. Lin, S\. Rajmohan,et al\.\(2026\-04\)StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis\.arXiv:2510\.10074\.Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[53\]M\. A\. Merrill, A\. G\. Shaw, N\. Carlini, B\. Li, H\. Raj, I\. Bercovich, L\. Shi, J\. Y\. Shin, T\. Walshe, E\. K\. Buchanan, J\. Shen, G\. Ye, H\. Lin, J\. Poulos, M\. Wang, M\. Nezhurina, J\. Jitsev, D\. Lu, O\. M\. Mastromichalakis, Z\. Xu, Z\. Chen, Y\. Liu, R\. Zhang, L\. L\. Chen, A\. Kashyap, J\. Uslu, J\. Li, J\. Wu, M\. Yan, S\. Bian, V\. Sharma, K\. Sun, S\. Dillmann, A\. Anand, A\. Lanpouthakoun, B\. Koopah, C\. Hu, E\. Guha, G\. H\. S\. Dreiman, J\. Zhu, K\. Krauth, L\. Zhong, N\. Muennighoff, R\. Amanfu, S\. Tan, S\. Pimpalgaonkar, T\. Aggarwal, X\. Lin, X\. Lan, X\. Zhao, Y\. Liang, Y\. Wang, Z\. Wang, C\. Zhou, D\. Heineman, H\. Liu, H\. Trivedi, J\. Yang, J\. Lin, M\. Shetty, M\. Yang, N\. Omi, N\. Raoof, S\. Li, T\. Y\. Zhuo, W\. Lin, Y\. Dai, Y\. Wang, W\. Chai, S\. Zhou, D\. Wahdany, Z\. She, J\. Hu, Z\. Dong, Y\. Zhu, S\. Cui, A\. Saiyed, A\. Kolbeinsson, J\. Hu, C\. M\. Rytting, R\. Marten, Y\. Wang, A\. Dimakis, A\. Konwinski, and L\. Schmidt\(2026\-01\)Terminal\-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces\.arXiv:2601\.11868\.Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p4.1)\.
- \[54\]J\. J\. Meza, T\. Gowda, A\. Eid, T\. Ijaware, D\. Chernyshev, Y\. Yu, M\. N\. Uddin, R\. Das, C\. Nachiappan, S\. Tran, S\. Shi, T\. Luo, D\. K\. Hong, S\. Panneerselvam, H\. Ragas, S\. Manavski, W\. Wang, and F\. Richard\(2023\-07\)Defcon: Preventing Overload with Graceful Feature Degradation\.InProceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation \(OSDI’23\),Cited by:[Appendix H](https://arxiv.org/html/2605.07161#A8.p3.1),[Appendix H](https://arxiv.org/html/2605.07161#A8.p4.1),[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.11.2)\.
- \[55\]Open source metrics and monitoring for your systems and services\.Note:[https://prometheus\.io](https://prometheus.io/)Cited by:[1st item](https://arxiv.org/html/2605.07161#S2.I4.i1.p1.1)\.
- \[56\]\(2024\)Otel\-Demo \- A microservice\-based distributed system intended to illustrate the implementation of OpenTelemetry in a near real\-world environment\.\.Note:[https://github\.com/open\-telemetry/opentelemetry\-demo](https://github.com/open-telemetry/opentelemetry-demo)Cited by:[§2\.2](https://arxiv.org/html/2605.07161#S2.SS2.p1.1)\.
- \[57\]N\. Palix, G\. Thomas, S\. Saha, C\. Calvès, J\. Lawall, and G\. Muller\(2011\-03\)Faults in Linux: Ten Years Later\.InProceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems \(ASPLOS’11\),Cited by:[item 1](https://arxiv.org/html/2605.07161#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.07161#S1.p3.1),[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.4.2)\.
- \[58\]T\. Patel and D\. Tiwari\(2020\-02\)CLITE: Efficient and QoS\-Aware Co\-Location of Multiple Latency\-Critical Jobs for Warehouse Scale Computers\.InProceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture \(HPCA’20\),Cited by:[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.14.2)\.
- \[59\]C\. Pei, Z\. Wang, F\. Liu, Z\. Li, Y\. Liu, X\. He, R\. Kang, T\. Zhang, J\. Chen, J\. Li, G\. Xie, and D\. Pei\(2025\-05\)Flow\-of\-Action: SOP Enhanced LLM\-Based Multi\-Agent System for Root Cause Analysis\.InCompanion Proceedings of the ACM on Web Conference 2025 \(WWW’25\),Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[60\]T\. Pelkonen, S\. Franklin, J\. Teller, P\. Cavallaro, Q\. Huang, J\. Meza, and K\. Veeraraghavan\(2015\-08\)Gorilla: a Fast, Scalable, In\-memory Time Series Database\.Proceedings of the VLDB Endowment \(VLDB’15\)8\(12\),pp\. 1816–1827\.Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p2.1)\.
- \[61\]Pods\.Note:[https://kubernetes\.io/docs/concepts/workloads/pods](https://kubernetes.io/docs/concepts/workloads/pods)Cited by:[§3\.4](https://arxiv.org/html/2605.07161#S3.SS4.p1.1)\.
- \[62\]\(2024\-05\)Quantifying GitHub Copilot’s Impact in the Enterprise with Accenture\.Note:[https://github\.blog/news\-insights/research/research\-quantifying\-github\-copilots\-impact\-in\-the\-enterprise\-with\-accenture/](https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-in-the-enterprise-with-accenture/)Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p1.1)\.
- \[63\]\(2025\)Resolve Satellite\.Note:[https://docs\.resolve\.ai/resolve\-satellite](https://docs.resolve.ai/resolve-satellite)Cited by:[§2\.2](https://arxiv.org/html/2605.07161#S2.SS2.p1.1)\.
- \[64\]\(2026\)Resolve\.ai \| AI for prod\.Note:[https://resolve\.ai](https://resolve.ai/)Cited by:[Appendix G](https://arxiv.org/html/2605.07161#A7.p2.1),[§2\.2](https://arxiv.org/html/2605.07161#S2.SS2.p1.1),[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[65\]B\. Schroeder, S\. Damouras, and P\. Gill\(2010\-02\)Understanding Latent Sector Errors and How to Protect against Them\.InProceedings of the 8th USENIX Conference on File and Storage Technologies \(FAST’10\),Cited by:[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.5.2)\.
- \[66\]B\. H\. Sigelman, L\. A\. Barroso, M\. Burrows, P\. Stephenson, M\. Plakal, D\. Beaver, S\. Jaspan, and C\. Shanbhag\(2010\-04\)Dapper, a Large\-Scale Distributed Systems Tracing Infrastructure\.Technical reportTechnical Reportdapper\-2010\-1,Google, Inc\.\.Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p2.1)\.
- \[67\]\(2026\)SRE\-skills\-bench: LLM Benchmark for SRE Tasks\.Note:[https://github\.com/Rootly\-AI\-Labs/SRE\-skills\-bench](https://github.com/Rootly-AI-Labs/SRE-skills-bench)Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p3.1),[§4](https://arxiv.org/html/2605.07161#S4.p2.1)\.
- \[68\]\(2013\)stress\-ng \- a tool to load and stress a computer system\.Note:[https://wiki\.ubuntu\.com/Kernel/Reference/stress\-ng](https://wiki.ubuntu.com/Kernel/Reference/stress-ng)Cited by:[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.3.1)\.
- \[69\]X\. Sun, R\. Cheng, J\. Chen, E\. Ang, O\. Legunsen, and T\. Xu\(2020\-11\)Testing Configuration Changes in Context to Prevent Production Failures\.InProceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation \(OSDI’20\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p2.1)\.
- \[70\]X\. Sun, W\. Luo, J\. T\. Gu, A\. Ganesan, R\. Alagappan, M\. Gasch, L\. Suresh, and T\. Xu\(2022\-07\)Automatic Reliability Testing for Cluster Management Controllers\.InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation \(OSDI’22\),Cited by:[item 1](https://arxiv.org/html/2605.07161#S1.I1.i1.p1.1),[§2\.4](https://arxiv.org/html/2605.07161#S2.SS4.p2.1)\.
- \[71\]P\. Tang, S\. Tang, H\. Pu, Z\. Miao, and Z\. Wang\(2025\-09\)MicroRCA\-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents\.arXiv:2509\.15635\.Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[72\]Y\. Tian, Y\. Liu, Z\. Chong, Z\. Huang, and H\. Jacobsen\(2025\-08\)GALA: Can Graph\-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis?\.arXiv:2508\.12472\.Cited by:[§2\.5](https://arxiv.org/html/2605.07161#S2.SS5.p2.16)\.
- \[73\]TierZero \- Agents that handle the incidents, alerts, and internal questions that fragment your team’s day\.\.Note:[https://www\.tierzero\.ai/](https://www.tierzero.ai/)Cited by:[Appendix G](https://arxiv.org/html/2605.07161#A7.p2.1),[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[74\]B\. Treynor, M\. Dahlin, V\. Rau, and B\. Beyer\(2017\-08\)The Calculus of Service Availability\.Communications of the ACM \(CACM\)60\(9\),pp\. 42–47\.Cited by:[§2\.4](https://arxiv.org/html/2605.07161#S2.SS4.p2.1)\.
- \[75\]K\. Veeraraghavan, J\. Meza, S\. Michelson, S\. Panneerselvam, A\. Gyori, D\. Chou, S\. Margulis, D\. Obenshain, S\. Padmanabha, A\. Shah, Y\. J\. Song, and T\. Xu\(2018\-10\)Maelstrom: Mitigating Datacenter\-level Disasters by Draining Interdependent Traffic Safely and Efficiently\.InProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation \(OSDI’18\),Cited by:[§2\.4](https://arxiv.org/html/2605.07161#S2.SS4.p2.1)\.
- \[76\]H\. Wang, Q\. Mang, A\. Cheung, K\. Sen, and D\. Song\(2026\)How We Broke Top AI Agent Benchmarks: And What Comes Next\.Note:[https://rdi\.berkeley\.edu/blog/trustworthy\-benchmarks\-cont/](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/)Cited by:[Appendix B](https://arxiv.org/html/2605.07161#A2.p6.1)\.
- \[77\]Y\. Wang, G\. Yu, H\. Huang, Z\. Wang, Y\. Huang, P\. Chen, and M\. R\. Lyu\(2026\-02\)Cloud\-OpsBench: A Reproducible Benchmark for Agentic Root Cause Analysis in Cloud Systems\.arXiv:2603\.00468\.Cited by:[Appendix B](https://arxiv.org/html/2605.07161#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.07161#A2.p2.1),[Appendix B](https://arxiv.org/html/2605.07161#A2.p5.1),[§2\.5](https://arxiv.org/html/2605.07161#S2.SS5.p1.1),[§4](https://arxiv.org/html/2605.07161#S4.p2.1),[§4](https://arxiv.org/html/2605.07161#S4.p4.1)\.
- \[78\]Z\. Wang, Z\. Liu, Y\. Zhang, A\. Zhong, J\. Wang, F\. Yin, L\. Fan, L\. Wu, and Q\. Wen\(2024\-10\)RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool\-Augmented Large Language Models\.InProceedings of the 33rd ACM International Conference on Information and Knowledge Management \(CIKM’24\),Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[79\]J\. Xu, Q\. Zhang, Z\. Zhong, S\. He, C\. Zhang, Q\. Lin, D\. Pei, P\. He, D\. Zhang, and Q\. Zhang\(2025\-01\)OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?\.InProceedings of the 13th International Conference on Learning Representations \(ICLR’25\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p3.1),[§4](https://arxiv.org/html/2605.07161#S4.p2.1)\.
- \[80\]T\. Xu, X\. Jin, P\. Huang, Y\. Zhou, S\. Lu, L\. Jin, and S\. Pasupathy\(2016\-11\)Early Detection of Configuration Errors to Reduce Failure Damage\.InProceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation \(OSDI’16\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p2.1)\.
- \[81\]T\. Xu, J\. Zhang, P\. Huang, J\. Zheng, T\. Sheng, D\. Yuan, Y\. Zhou, and S\. Pasupathy\(2013\-11\)Do Not Blame Users for Misconfigurations\.InProceedings of the 24th ACM Symposium on Operating Systems Principles \(SOSP’13\),Cited by:[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.7.2)\.
- \[82\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\-05\)ReAct: Synergizing Reasoning and Acting in Language Models\.InProceedings of the 11th International Conference on Learning Representations \(ICLR ’23\),Cited by:[Appendix B](https://arxiv.org/html/2605.07161#A2.p3.1)\.
- \[83\]D\. Yuan, H\. Mai, W\. Xiong, L\. Tan, Y\. Zhou, and S\. Pasupathy\(2010\-03\)SherLog: Error Diagnosis by Connecting Clues from Run\-Time Logs\.InProceedings of the 15th International Conference on Architecture Support for Programming Languages and Operating Systems \(ASPLOS’10\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p2.1)\.
- \[84\]E\. Zhai, A\. Chen, R\. Piskac, M\. Balakrishnan, B\. Tian, B\. Song, and H\. Zhang\(2020\-02\)Check before You Change: Preventing Correlated Failures in Service Updates\.InProceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation \(NSDI’20\),Cited by:[item 3](https://arxiv.org/html/2605.07161#S1.I1.i3.p1.1),[3rd item](https://arxiv.org/html/2605.07161#S2.I5.i3.p1.1)\.
- \[85\]L\. Zhang, T\. Jia, K\. Wang, W\. Hong, C\. Duan, M\. He, and Y\. Li\(2025\-08\)Adaptive Root Cause Localization for Microservice Systems with Multi\-Agent Recursion\-of\-Thought\.arXiv:2508\.20370\.Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[86\]L\. Zhang, Y\. Zhai, T\. Jia, C\. Duan, M\. He, L\. Pan, Z\. Liu, B\. Ding, and Y\. Li\(2025\-11\)MicroRemed: Benchmarking LLMs in Microservices Remediation\.arXiv:2511\.01166\.Cited by:[Appendix B](https://arxiv.org/html/2605.07161#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.07161#A2.p5.1),[§4](https://arxiv.org/html/2605.07161#S4.p3.1)\.
- \[87\]W\. Zhang, H\. Guo, J\. Yang, Z\. Tian, Y\. Zhang, Y\. Chaoran, Z\. Li, T\. Li, X\. Shi, L\. Zheng, and B\. Zhang\(2024\-11\)mABC: Multi\-Agent Blockchain\-inspired Collaboration for Root Cause Analysis in Micro\-Services Architecture\.InFindings of the Association for Computational Linguistics \(EMNLP’24\),Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[88\]X\. Zhang, Q\. Wang, M\. Li, Y\. Yuan, M\. Xiao, F\. Zhuang, and D\. Yu\(2025\-11\)TAMO: Fine\-Grained Root Cause Analysis via Tool\-Assisted LLM Agent With Multi\-Modality Observation Data in Cloud\-Native Systems\.IEEE Transactions on Services Computing \(TSC\)18\(6\),pp\. 4221–4233\.Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[89\]Y\. Zhang, K\. Rodrigues, Y\. Luo, M\. Stumm, and D\. Yuan\(2019\-102019\-10\)The Inflection Point Hypothesis: A Principled Debugging Approach for Locating the Root Cause of a Failure\.InProceedings of the 27th ACM Symposium on Operating Systems Principles \(SOSP’19\),Cited by:[§1](https://arxiv.org/html/2605.07161#S1.p2.1)\.
- \[90\]Y\. Zhang, U\. Paul, M\. d’Amorim, and A\. Rahman\(2025\-12\)Configuration Defects in Kubernetes\.arXiv:2512\.05062\.Cited by:[Table 1](https://arxiv.org/html/2605.07161#S2.T1.5.8.2)\.
- \[91\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\-12\)Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena\.InProceedings of the 37th Annual Conference on Neural Information Processing Systems \(NeurIPS’23\),Cited by:[§2\.5](https://arxiv.org/html/2605.07161#S2.SS5.p2.16),[§2\.5](https://arxiv.org/html/2605.07161#S2.SS5.p4.5)\.
- \[92\]Z\. Zhong, R\. Fu, M\. Ma, S\. Zhang, Y\. Sun, C\. Bansal, and D\. Pei\(2026\-03\)LLM\-Enhanced Failure Localization in Microservices: Integrating Multi\-Modal Data and Expert Interpretation\.IEEE Transactions on Services Computing \(TSC\)\(\),pp\. 1–14\.Cited by:[§4](https://arxiv.org/html/2605.07161#S4.p1.1)\.
- \[93\]X\. Zhou, X\. Peng, T\. Xie, J\. Sun, C\. Xu, C\. Ji, and W\. Zhao\(2018\-05\)Benchmarking Microservice Systems for Software Engineering Research\.InProceedings of the 40th International Conference on Software Engineering: Companion Proceeedings \(ICSE ’18\),Cited by:[Appendix H](https://arxiv.org/html/2605.07161#A8.p4.1),[§2\.2](https://arxiv.org/html/2605.07161#S2.SS2.p1.1)\.
- \[94\]Y\. Zhu, T\. Jin, Y\. Pruksachatkun, A\. Zhang, S\. Liu, S\. Cui, S\. Kapoor, S\. Longpre, K\. Meng, R\. Weiss, F\. Barez, R\. Gupta, J\. Dhamala, J\. Merizian, M\. Giulianelli, H\. Coppock, C\. Ududec, J\. Sekhon, J\. Steinhardt, A\. Kellermann, S\. Schwettmann, M\. Zaharia, I\. Stoica, P\. Liang, and D\. Kang\(2025\-07\)Establishing Best Practices for Building Rigorous Agentic Benchmarks\.arXiv:2507\.02825\.Cited by:[2nd item](https://arxiv.org/html/2605.07161#S2.I3.i2.p1.2)\.

## Appendix ADiagnosis Evaluation Checklist

Table[6](https://arxiv.org/html/2605.07161#A1.T6)lists the full checklist used by the diagnosis oracle \(§[2\.5](https://arxiv.org/html/2605.07161#S2.SS5)\)\. Each question requires a Yes/No answer; the LLM evaluator returns supporting evidence and a confidence \(High/Medium/Low\) per question\. Dimension weights and the pass threshold are set in a YAML file \(current default:w=13w=\\frac\{1\}\{3\}\)\.

Table 6:Full diagnosis evaluation checklist\. Each question is answered Yes/No by the LLM evaluator\. The per\-dimension score formula and the aggregated score formula can be found in §[2\.5](https://arxiv.org/html/2605.07161#S2.SS5)\.DimensionIDQuestionEvaluator HintFaultLocalization\(w=13w\\\!=\\\!\\frac\{1\}\{3\}\)D1\-Q1Does the diagnosis name the same service, deployment, pod, node, or infrastructure component that the ground\-truth identifies as the fault origin?Compare against the target component and target resource type in the fault specification \(spec\)\.D1\-Q2Does the diagnosis correctly distinguish the fault origin from any secondary or cascading failure points mentioned in the ground\-truth?Check that the diagnosis points to the root\-cause component, not a downstream victim\.D1\-Q3Does the diagnosis avoid misidentifying a healthy component as the fault origin?Verify the diagnosed component matches the fault spec’s target component\.FaultCharacterization\(w=13w\\\!=\\\!\\frac\{1\}\{3\}\)D2\-Q1Does the diagnosis identify the same injected mechanism described in the ground\-truth \(e\.g\., wrong network port, missing environment variables, wrong container image, wrong selector, and memory limit\)?Match against the fault mechanism and injector method in the structured spec\.D2\-Q2Does the diagnosis include concrete mutated details from the injection logic \(e\.g\., environment variable, configuration value, network port, selector, container image tag, and resource limit\)?Compare concrete claims against parameters and the target mutation implied by the injector method\.D2\-Q3Does the diagnosis avoid attributing the fault to an incorrect or unrelated fault type?Check that the diagnosis does not conflict with the problem class, injector method, or injected parameter values\.ScopePrecision\(w=13w\\\!=\\\!\\frac\{1\}\{3\}\)D3\-Q1Does the diagnosis avoid blaming components that are not identified in the ground\-truth as contributing to the fault?Check for over\-attribution: the diagnosis should not blame uninvolved components\.D3\-Q2Does the diagnosis include all components listed in the ground\-truth as contributing to or affected by the fault?Check for under\-attribution: all ground\-truth components should be pointed out\.D3\-Q3Does the diagnosis correctly describe the impact or symptom consistent with what the ground\-truth states?Compare stated impact against mechanism, parameters, and target component in the fault spec\.
## Appendix BEngineering Practices

While buildingSREGym, we made several engineering decisions that we believe are important for any benchmark aiming to evaluate autonomous SRE agents\. We document them as recommendations for future SRE benchmark developers, with the rationale behind each and, where useful, examples of departures we observed in existing benchmarks\[[14](https://arxiv.org/html/2605.07161#bib.bib27),[41](https://arxiv.org/html/2605.07161#bib.bib28),[86](https://arxiv.org/html/2605.07161#bib.bib86),[77](https://arxiv.org/html/2605.07161#bib.bib120)\]\.

Benefits of live environments\.A natural design is to capture environment telemetry at one moment and present it to the agents as a static artifact, motivated by a desire to remove nondeterminism from evaluation\[[77](https://arxiv.org/html/2605.07161#bib.bib120)\]\. However, static artifacts are fundamentally limited\. Noise and nondeterminism are intrinsic properties of the production environments that SRE agents must operate in, where signals are often partial, telemetry could race with the failures, and the environment state evolves while the agents are taking actions\. Snapshots reduce the task to static log triage and eliminate the operational skills the benchmark claims to measure: forming hypotheses, issuing probes, and observing how the live environment responds\. A snapshot also collapses the interactive nature of troubleshooting\. Real incident response is iterative and time\-ordered: an SRE engineer/agent issues a command, waits for it to take effect, watches the system’s state, and decides the next step accordingly\.SREGymtherefore adopts a live system environment, so the agent is evaluated against noisy, evolving environments mirroring production systems\.

No restriction on the agent architecture\.The space of SRE agents includes multi\-agent systems with domain\-specific tools \(e\.g\., Stratus\[[13](https://arxiv.org/html/2605.07161#bib.bib54)\]\) as well as general\-purpose coding agents \(e\.g\., Claude Code and Codex\)\. AIOpsLab\[[14](https://arxiv.org/html/2605.07161#bib.bib27)\], for example, requires the evaluated agent to interact in a ReAct\[[82](https://arxiv.org/html/2605.07161#bib.bib122)\]loop mediated by an orchestrator that parses every action and exposes a fixed set of function signatures\. Agents not built in the ReAct architecture would need to be ported or integrated to be evaluated on AIOpsLab\.SREGymkeeps its agent interface minimal: only thesubmit\(\)call is required\. This decoupling lets us evaluate the same set of problems against architectures as different as Stratus, Claude Code, and Codex with the same benchmark framework\.

Programmable runtime for synchronizing and coordinating events\.The benchmark runtime must be expressive enough to schedule and synchronize events across injectors, oracles, and the agent’s actions\. Two classes ofSREGymproblems make this requirement concrete\. Metastable failures \(see §[D\.1](https://arxiv.org/html/2605.07161#A4.SS1)\) pair a self\-sustaining application trigger \(e\.g\., a misconfiguredGOGCvalue or a retry\-storm configuration\) with an infrastructure constraint \(e\.g\., a tight namespace memory quota\); the two components must be injected with the right timing and ordering to drive the system into the metastable state, and the mitigation oracle must continuously observe the environment to distinguish a transient recovery from a relapse\. Noise simulation requires a separate concurrent loop that injects transient events on its own schedule while the target fault is active, so the agent sees a contemporaneous mix of distractor and target evidence\. Benchmarks built on substrates like Ansible Playbooks as in ITBench\[[41](https://arxiv.org/html/2605.07161#bib.bib28)\]cannot express these coordination patterns directly and tend to defer the missing functionality toad hocshell scripts, which makes timing\-sensitive faults and concurrent noises hard to control and to reproduce\.SREGymruns as a Python service that owns fault scheduling, noise scheduling, and oracle probing in a single process, so problem developers can express timing\-sensitive behavior in the problem definition rather than by external, indirect scripts\.

Avoiding misuses of chaos\-engineering tools\.Chaos engineering tools such as Chaos Mesh\[[11](https://arxiv.org/html/2605.07161#bib.bib121)\]and Chaosblade\[[12](https://arxiv.org/html/2605.07161#bib.bib66)\]are valuable for their intended use cases: perturbing a running system to test application resiliency against unexpected failures\. By design, these tools injectsymptoms\(killed pods, dropped packets, throttled CPUs, latency spikes\) rather thandefects\. There are no underlying faults for an SRE agent to discover, and the only valid “mitigation” is to stop the chaos injectors, which isnotan operational skill an agent should be rewarded for learning\. Several recent benchmarks conflate resiliency testing with operational evaluation: MicroRemed\[[86](https://arxiv.org/html/2605.07161#bib.bib86)\]sources its faults entirely from Chaos Mesh; Cloud\-OpsBench\[[77](https://arxiv.org/html/2605.07161#bib.bib120)\]draws on Chaosblade alongside its Kubernetes misconfigurations; ITBench\[[41](https://arxiv.org/html/2605.07161#bib.bib28)\], which otherwise injects faults via direct Kubernetes manipulation, wraps a Chaos Mesh schedule for roughly 16% of its SRE scenarios \(6 of 36\)\. For those scenarios, there is no defect for the agent to fix, and the only action that “mitigates” the problem is stopping the chaos\-engineering tools or deleting their schedule\.SREGyminjects faults via direct Kubernetes manipulation, syscall\-level eBPF probes, and operator misoperation rather than symptomatic perturbations, so every SRE problem has underlying defects for the agent to diagnose and resolve\.

Protection against reward hacking\.AI agents can exploit benchmark infrastructure to inflate scores without solving the underlying tasks\[[76](https://arxiv.org/html/2605.07161#bib.bib5)\]\. In SRE benchmarks, the most direct manifestation is an agent that discovers and disables the fault\-injection services rather than reasoning about the actual faults\. Neither AIOpsLab\[[14](https://arxiv.org/html/2605.07161#bib.bib27)\]nor ITBench\[[41](https://arxiv.org/html/2605.07161#bib.bib28)\]protects against such reward hacking properly: their fault injectors run as identifiable pods in the same environment the agent inspects\. A second exploit is treating alert\-clearing as the success signal: the Stratus paper\[[13](https://arxiv.org/html/2605.07161#bib.bib54)\]reports that 8 of 18 ITBench mitigation problems \(44%\) can be “solved” by a generic pod\-restart loop, where the fault injector loses track of the pod after it is restarted, the alert clears, and the agent is credited with a successful mitigation \(despite taking no action on the actual defects\)\.SREGymhides its fault\-injection plane behind a proxy that evaluated agents have no visibility into, and uses state\-based mitigation oracles that probe live environment health rather than relying on alert suppression\.

## Appendix CCombinatorial Coverage Details

We provide the detailed breakdown behind the combinatorial coverage reported in the preamble of §[2\.4](https://arxiv.org/html/2605.07161#S2.SS4)\. Table[7](https://arxiv.org/html/2605.07161#A3.T7)groupsSREGym’s fault primitives by the classes of target components \(referred to as “target” for short\) they are compatible with, and reports the number of viable \(fault, target\) pairs per class across the 139\-pod injection\-target space \(plus three worker nodes for hardware\-level faults\)\.

Table 7:Fault\-target compatibility classes\. “\# Faults” refers to the number of fault primitives in each class; since one fault primitive can be compatible with multiple target classes \(e\.g\., a missing environment variable affects both MongoDB and non\-MongoDB pods\), the column does not sum to the primitive total\. “Pairs” is the count of viable \(fault, target\) combinations, totaling 3,623\.Class\# FaultsCompatible targetsPairsUniversal Kubernetes\-level25139 pods3,475Storage\-dependent56 PVC\-mounted pods30DaemonSet\-level13 daemonsets3Operator\-level65 Kubernetes applications30MongoDB\-specific418 MongoDB pods72Valkey\-specific21 pod2App\-layer misconfiguration12 services2Node/kernel33 worker nodes9Total3,623The 90 curated problems we evaluate against in §[3](https://arxiv.org/html/2605.07161#S3)exercise only 2\.5% of the 3,623 viable \(fault, target\) pairs\. The noise simulation and multi\-fault composition further multiply the combinations\. The same combinatorial structure positionsSREGymas a natural environment for reinforcement\-learning rollouts, where each \(fault, target\) pair becomes an episode against a live, production\-like system environment \(§[5](https://arxiv.org/html/2605.07161#S5)\)\.

## Appendix DCase Studies

### D\.1Metastable Failure

SREGymincludes problems that model metastable failures, where a temporary trigger pushes the system into a self\-sustaining degraded state that persists even after the trigger is removed\[[10](https://arxiv.org/html/2605.07161#bib.bib24),[36](https://arxiv.org/html/2605.07161#bib.bib6),[38](https://arxiv.org/html/2605.07161#bib.bib25)\]\. Each problem is acompound fault: an application\-level trigger \(e\.g\., a misconfigured retry policy that amplifies traffic, or a runtime flag that forces frequent garbage collection\) paired with an infrastructure constraint that drives the system into a vulnerable state \(e\.g\., a resource quota or limit that caps CPU or memory allocation\)\. The constraint produces no errors and no failed traces; it is discoverable only through explicit inspection of deployment configuration\. The trigger drives the system from the vulnerable state to the self\-sustaining metastable state: system performance degrades with no explicit fail\-stop symptoms\. Only fixing the trigger would not mitigate the symptoms: the agent must reason about the relationship between the trigger and the sustained degraded state, then mitigate the problem by removing the infrastructure\-level constraint and restarting the application, giving it a clean slate\.

In our evaluation, agents reliably diagnosed the application\-level trigger through trace analysis and deployment inspection, but almost never discovered the infrastructure constraint because it produces no observable symptom\. In the one run where Codexdidfind the infrastructure resource constraint, it attributed the entire failure to it and dismissed the application\-level misconfiguration\. In no case did the agent identify the interaction between the trigger and the constraint\.

The agents are observed to fix the metastable behavior at times\. Agents fixed the misconfiguration and restarted the affected services; this restores the system and cleans up the compounded traffic, so the system can resume normal execution\. When the agent fails to mitigate the metastable behavior, it is often because the agent is distracted by surface\-level symptoms and attributes the failure to an entirely different fault type\. For example, in one run of thegc\_capacity\_degradationproblem, where an aggressively lowGOGCsetting forces frequent garbage collection across all workloads in a capacity\-constrained namespace, the agent never inspectedGOGCat all\. Instead, the agent fixated on distributed traces showing a∼\\sim300 ms gap between the frontend and the profile\-service, while the server\-sideGetProfileshandler completed in microseconds\. From this pattern alone, the agent concluded that the delay must live in the network path, and submitted a root cause of atcnetemrule injecting∼\\sim150 ms of latency in each direction on the profile\-service pod’svethinterface\. This incorrect diagnosis makes two independent errors: \(1\) the fault type is network latency injection, and \(2\) the scope is a single service\. As a result, the agent proposed mitigations targeting a nonexistent network rule, and never restarted the workloads to clear the metastable garbage\-collection loop\.

### D\.2Hardware Fault

Thelatent\_sector\_errorproblem inSREGyminjects hardware faults into the storage devices of a physical node\. A storage disk would return intermittent errors on file\-reading system calls \(e\.g\.,pread\(\)\)\. The MongoDB databases deployed on that node would crash with “read:input/outputerror,” as they are unable to read storage\-engine\-related metadata files\.

Our evaluation shows that agents struggle to diagnose the hardware\-related root cause\. Across three Stratus, three Claude Code, and three Codex runs with no noise simulation, none produced a diagnosis score above0\.220\.22, and D2 \(fault characterization\) scored0in every run, showing that their diagnosis never reached the correct component of the system\. For example, the agents observe that the I/O errors come from thememcacheddeployment, which reads from the MongoDB deployment, and incorrectly attribute the failure to dropped connections from thememcacheddeployment\. Agents also blame user workloads for overwhelming thememcachedprocess, causing it to reset connections\. Lastly, they blame the microservice application for not gracefully handling the I/O error\. These conclusions never mention that the underlying hardware can be defective\.

This problem exposes a specific weakness: there is a lack of understanding in how the underlying hardware can affect the application deployed atop it\. Agents treat the exposed I/O error as evidence of incorrect error handling, but never suggest disk\-level diagnostics that are standard SRE responses to hardware failures\.

### D\.3Greedy Approach in Diagnosing Failures

A recurring agent failure pattern across our evaluation is the agent taking a greedy approach in diagnosing and mitigating any first\-observed anomaly in the environment\. The agent first encounters a suspected but unrelated anomaly in the environment, and submits an immature diagnosis, which later misleads the mitigation phase\. These suspected anomalies can be injected noises running alongside the target fault, or pre\-existing application characteristics such as low memory limits in application containers\.

We observe that the pattern is consistent regardless of the anomaly the agent picks up\. Figure[1](https://arxiv.org/html/2605.07161#S1.F1)shows an example using themissing\_env\_variable\_astronomy\_shopproblem, where the actual fault is a dropped environment variable on the frontend component of the application\. Across all runs of both Stratus and Claude Code, neither agent ever inspected the frontend component\. Stratus latched onto another deployment with a low memory limit, while Claude Code focused on Jaeger tracing infrastructure\. Both found plausible\-looking issues, diagnosed them as the root cause, and never investigated the actual faulty component\. This greedy approach affects mitigation effectiveness\. After fixating on the memory limit, the agent raises it withkubectlpatchor a similar edit, but clearly this fix does not address the offending fault\.

A more interesting \(or say pitiful\) case occurs when the agentdoesencounter evidence of the offending fault but still anchors on its initial hypothesis\. In thevalkey\_auth\_disruptionproblem, the fault is an invalid password set on the Valkey database at runtime\. The dependent microservice component crashes because it cannot authenticate to the database\. Agents often attributed the crash to insufficient memory, noting that the containers had low memory limits\. However, the agents also observed that the init container’s TCP health check succeeded while the application\-layer database connection failed\. This is a discrepancy consistent with authentication failure, not resource exhaustion\. In one run, the agent mentioned the password\-setting command \(requirepass\) multiple times as a candidate hypothesis but defaulted to the memory explanation\.

This case study exposes a specific weakness: the agent treats the first plausible abnormality as a stopping criterion\. Once a candidate root cause is written down, in the mitigation phase, the model interprets subsequent evidence as supporting it and does not generate competing hypotheses\. A human SRE, in contrast, would be able to form multiple hypotheses and investigate concurrently, until the real root cause is found\.

## Appendix ETool Usage Details

We provide a detailed analysis of the tool usage reported in §[3\.4](https://arxiv.org/html/2605.07161#S3.SS4)\. We classify every tool call in the agent trajectories into the following categories:

- •kubectl \(read\)for read\-only inspection of system state \(e\.g\.,get,logs,describe,top,auth\)\.
- •kubectl \(write\)for changing system state \(e\.g\.,patch,rollout,set,run,delete,apply,scale,port\-forward\)\.
- •Observabilityfor querying metrics, traces, and service maps\. Stratus uses dedicated agent tools; Claude Code and Codex usecurlto query Prometheus, Jaeger, or Loki endpoints\.
- •Networkfor connectivity probes \(e\.g\.,curlto application endpoints,ss,netstat,nc,ping\)\.
- •Wait/sleepfor deliberate pausing \(e\.g\.,wait\_toolfor Stratus andsleepfor others\)\.
- •Shellfor general\-purpose utilities \(e\.g\.,cat,ls,grep,python,sed,export\)\.
- •Otherssuch as agent\-framework\-native tools \(e\.g\., Claude Code’sRead,Agent,WebSearch; Kimi’sthinkingtool\)\.

Submission calls are excluded from all counts\. Table[8](https://arxiv.org/html/2605.07161#A5.T8)shows the tool call category breakdown per evaluation configuration\. Table[9](https://arxiv.org/html/2605.07161#A5.T9)shows the ratio of read to write commands in the system environment\. Table[10](https://arxiv.org/html/2605.07161#A5.T10)and Table[11](https://arxiv.org/html/2605.07161#A5.T11)show the top subcommands executed inkubectl\. All averages are computed as the mean across runs per problem, then the mean across problems\.

Table 8:Tool call category breakdown\.AgentModelNoisekctl\(read\)kctl\(write\)ObservabilityNetworkWaitShellOtherStratus![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x26.png)Sonnet 4\.6□\\square65\.811\.916\.60\.05\.60\.00\.0■\\blacksquare71\.88\.314\.50\.05\.10\.00\.2![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x27.png)K2\.5□\\square71\.68\.514\.60\.04\.60\.00\.7■\\blacksquare72\.08\.114\.80\.04\.40\.00\.6Claude Code![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x28.png)Sonnet 4\.6□\\square61\.54\.62\.48\.34\.216\.52\.5■\\blacksquare64\.84\.52\.18\.52\.814\.03\.4Codex![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x29.png)GPT\-5\.4□\\square60\.27\.72\.48\.72\.718\.40\.1■\\blacksquare61\.67\.62\.77\.42\.717\.80\.2Table 9:kubectlread/write analysis\. Ratio is total reads divided by total writes\. “Reads→\\toFirst Write” is the mean number of read\-onlykubectlcommands before the first mutation of system state, averaged per problem then across problems\.AgentModelNoiseRatioReads→\\toFirst WriteStratus![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x30.png)Sonnet 4\.6□\\square5\.5:122\.7■\\blacksquare8\.6:124\.2![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x31.png)K2\.5□\\square8\.4:124\.8■\\blacksquare8\.9:127\.6Claude Code![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x32.png)Sonnet 4\.6□\\square13\.5:121\.3■\\blacksquare14\.5:123\.9Codex![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x33.png)GPT\-5\.4□\\square7\.9:118\.9■\\blacksquare8\.1:121\.2Table 10:Top\-3kubectlread subcommands per agent \(% of all read\-onlykubectlcalls\)AgentModelNoise1st2nd3rdStratus![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x34.png)Sonnet 4\.6□\\squareget \(64\)logs \(18\)describe \(13\)■\\blacksquareget \(65\)logs \(17\)describe \(14\)![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x35.png)K2\.5□\\squareget \(69\)logs \(20\)describe \(9\)■\\blacksquareget \(67\)logs \(21\)describe \(10\)Claude Code![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x36.png)Sonnet 4\.6□\\squareget \(59\)logs \(23\)describe \(10\)■\\blacksquareget \(62\)logs \(19\)describe \(11\)Codex![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x37.png)GPT\-5\.4□\\squareget \(60\)logs \(24\)exec \(7\)■\\blacksquareget \(57\)logs \(25\)exec \(8\)Table 11:Top\-5kubectlwrite subcommands per agent \(% of all mutatingkubectlcalls\)AgentModelNoise1st2nd3rd4th5thStratus![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x38.png)Sonnet 4\.6□\\squarepatch \(39\)set \(19\)run \(14\)delete \(9\)rollout \(8\)■\\blacksquarepatch \(41\)set \(23\)rollout \(9\)delete \(7\)create \(6\)![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x39.png)K2\.5□\\squarepatch \(41\)set \(15\)create \(13\)rollout \(11\)delete \(6\)■\\blacksquarepatch \(40\)set \(13\)delete \(11\)create \(10\)rollout \(9\)Claude Code![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x40.png)Sonnet 4\.6□\\squarerollout \(35\)patch \(32\)run \(9\)delete \(7\)apply \(5\)■\\blacksquarerollout \(40\)patch \(29\)delete \(11\)run \(7\)apply \(4\)Codex![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x41.png)GPT\-5\.4□\\squarerollout \(33\)patch \(20\)run \(19\)port\-fwd \(12\)delete \(5\)■\\blacksquarerollout \(29\)run \(21\)patch \(17\)port\-fwd \(11\)delete \(11\)Table 12:Mitigation retry analysis \(Stratus variants only\)\. The number of total mitigation attempts and mitigation reads are conditioned on whether the initial diagnosis was correct \(D\) or incorrect \(¬\\negD\)\.AttemptsMitig\. readsAgentModelNoiseD¬\\negDD¬\\negDStratus![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x42.png)Sonnet 4\.6□\\square1\.883\.8217\.366\.2■\\blacksquare1\.551\.6113\.221\.6![[Uncaptioned image]](https://arxiv.org/html/2605.07161v1/x43.png)K2\.5□\\square1\.791\.9019\.326\.4■\\blacksquare1\.561\.5116\.421\.2
## Appendix FAnalysis of Total Tokens and End\-to\-End Success Rate

Figure[6](https://arxiv.org/html/2605.07161#A6.F6)plots mean total tokens per run against end\-to\-end success rate for each \(agent, model, noise\) configuration\. The plot separates two regimes\. Stratus consumes 0\.53M tokens per run on average \(range 0\.41–0\.81M\), while Claude Code uses 1\.59M \(3\.0×\\times\) and Codex uses 1\.93M \(3\.6×\\times\), consistent with Stratus’s multi\-agent architecture that preprocesses observability data and prompts the underlying LLM with only the filtered subset\. The coding agents instead pull raw observability streams into the LLM context, which inflates token usage without a corresponding increase in end\-to\-end success\.

Token spend does not predict success in this set\. The highest end\-to\-end rate is achieved by Claude Code in the environment with no noise injected, but Stratus configurations land within a few points of it while spending a small fraction of the tokens\. Codex spends the most tokens of any configuration and trails Claude Code in end\-to\-end success, indicating that onSREGymthe marginal token does not buy additional success\.

![Refer to caption](https://arxiv.org/html/2605.07161v1/x44.png)Figure 6:Mean total tokens per run versus end\-to\-end success rate \(P\(D∧M\)P\(\\text\{D\}\\wedge\\text\{M\}\)\) for each \(agent, model, noise\) configuration\. Circles denote the clean condition; crosses denote the noisy condition\. Stratus configurations occupy a low\-token / mid\-success regime, while Claude Code and Codex occupy a high\-token regime without a corresponding gain in end\-to\-end success\.Noise reduces end\-to\-end success rate across every agent \(the noise marker for each color sits below its clean counterpart\), but the token cost of noise is small\. The coding agents do spend somewhat more tokens under noise as they ingest additional anomaly signals, while Stratus’s token usage is essentially noise\-invariant because its preprocessing layer absorbs the extra signals before they reach the LLM\.

## Appendix GBroader Impacts

SREGymis a benchmark for evaluating AI agents on Site Reliability Engineering \(SRE\)\. We discuss the positive and negative societal impacts that we foresee from the work, along with mitigations\.

Positive impacts\.The most direct positive impact is improving the reliability of AI\-assisted computing system operations\. Production failures are a leading cause of service outages, financial loss, and engineer/operator burnout\. A growing number of AI products are being marketed for autonomous SRE\[[64](https://arxiv.org/html/2605.07161#bib.bib56),[19](https://arxiv.org/html/2605.07161#bib.bib95),[73](https://arxiv.org/html/2605.07161#bib.bib123),[7](https://arxiv.org/html/2605.07161#bib.bib90),[6](https://arxiv.org/html/2605.07161#bib.bib91)\]\. Without realistic, mitigation\-oriented benchmarks, claims about these technologies and products are difficult to verify\.SREGymis designed to evaluate AI agents’ capability towards autonomous SRE\. It injects faults across the full system stack \(application, platform, OS, hardware\), composes them with concurrent noises to model real\-world production environments, and verifies mitigation against system state\. By making it easier to evaluate SRE agents on realistic, high\-fidelity SRE problems, we aim to raise the empirical bar that AI/agentic SRE products must clear, and to reduce the chance that under\-tested agents are granted production access\. A secondary positive impact is an open infrastructure for research\. The composable APIs, fault and noise injectors, and MCP\-based agent interface are released so that other researchers can extendSREGymto new applications, new fault classes, and new agent architectures without rebuilding the scaffolding\. In fact, we have heard from a number of colleagues who are usingSREGymin their research projects\.

Negative impacts and mitigations\.We considered three categories of potential negative impact\. First, automation displacement: capable AI SRE agents could reduce demand for entry\-level operation work, much like AI coding assistants are reshaping software development jobs\. Our results, however, suggest this risk is currently distant: even frontier agents struggle onSREGym’s realistic failure scenarios \(see §[3](https://arxiv.org/html/2605.07161#S3)and §[D](https://arxiv.org/html/2605.07161#A4)\)\. Second, misuse of fault\-injection mechanisms:SREGym’s injectors could in principle be repurposed to cause harm in production systems that the user controls\. We note that all of the injection primitives we use are standard, widely available tools; therefore,SREGymdoes not introduce a novel attack capability beyond what these tools already expose\. The benchmark requires legitimate cluster credentials to operate, and mounting a real attack with these primitives is no easier than using the underlying tools directly\. Third, over\-reliance on benchmark numbers: a highSREGymscore is necessary but not sufficient evidence that an agent is safe to grant production access\. Section[H](https://arxiv.org/html/2605.07161#A8)explicitly enumerates the assumptions and scope under which our scores should be interpreted, including applications smaller than production scale, a simplified noise model, and a small set of evaluated agent\-model pairs\. We encourage our users to readSREGymresults as a lower bound on the failure modes an agent can recover from, not an upper bound on the failure modes it will encounter in real\-world production deployments\.

## Appendix HLimitations

SREGymhas limitations that shape how its results should be interpreted and extended\.

Oracle Variance\.The diagnosis oracle relies on an LLM evaluator, which introduces a source of variance that purely programmatic checks do not have\. We mitigate this by constraining the LLM evaluator to a structured rubric with per\-question answers \(Appendix[A](https://arxiv.org/html/2605.07161#A1)\) and by reporting per\-dimension breakdowns so that readers can see where judgments are stable versus noisy\. Despite the strong empirical reliability \(see Table[2](https://arxiv.org/html/2605.07161#S2.T2)\), it is still a best\-available approximation for evaluating natural\-language diagnosis results at scale\.

Noise Modeling\.The noise injected bySREGymis a simplification of real\-world production disturbances\. We inject transient pod crashes and resource stress on schedules, which captures one common pattern \(routine pod churn in a busy system\) but does not model all sources of production noises, such as high\-variance traffic anomalies\[[54](https://arxiv.org/html/2605.07161#bib.bib18)\], partial network partitions\[[3](https://arxiv.org/html/2605.07161#bib.bib81)\], or slow performance degradation caused by gradual resource exhaustion\[[17](https://arxiv.org/html/2605.07161#bib.bib7)\]\. Agents that perform well onSREGym’s noise model may still be surprised by noise patterns we do not cover yet\.

System Scale\.SREGym’s deployed applications are substantially smaller than production systems at major cloud providers\. Our largest application, Train Ticket\[[93](https://arxiv.org/html/2605.07161#bib.bib119)\], has 40 microservices, whereas production deployments at companies such as Uber, Netflix, and Meta commonly run thousands of interacting services\[[54](https://arxiv.org/html/2605.07161#bib.bib18),[27](https://arxiv.org/html/2605.07161#bib.bib43)\]\. Scale affects both the diagnosis \(agents face fewer candidate services to investigate\) and the mitigation evaluation \(fewer cross\-service dependencies to reason about\)\. Results onSREGymmay not scale linearly to systems that are orders of magnitude larger\.

Environment Scope\.SREGymtargets cloud\-native, Kubernetes\-based deployments, which are the dominant platform for modern production systems but not the only one\. Workloads running on monolithic deployments or edge deployments have different failure modes thatSREGymdoes not currently exercise\. Extending to these environments would require new fault injectors and observability integrations, which the composable architecture is designed to support\.

Agent Coverage in the Evaluation\.Constrained by our budget, we can only cover three agents \(Stratus, Claude Code, Codex\) paired with three frontier models \(Claude Sonnet 4\.6, Kimi K2\.5, GPT\-5\.4\) in our evaluation\. This is a small sample relative to the space of possible agent architectures and LLM backbones, and our conclusions about general\-purpose coding agents versus specialized SRE agents are correspondingly tentative\. We release the benchmark and scoring pipeline so that the community can evaluate additional agent\-model combinations and report results under the same oracles\. We are committed to supporting such efforts and maintaining the leaderboard\.

## Appendix IPer\-Problem End\-to\-End Results

Figures[7](https://arxiv.org/html/2605.07161#A9.F7)and[8](https://arxiv.org/html/2605.07161#A9.F8)show the per\-problem end\-to\-end \(E2E\) success rate for each agent with and without noise injected into the environment, respectively\. Problems \(rows\) are sorted by mean end\-to\-end rate across agents; agents \(columns\) are sorted by overall end\-to\-end rate\. A cell value ofk/3k/3indicates thatkkout of three runs achieved both correct diagnosis and successful mitigation\.

![Refer to caption](https://arxiv.org/html/2605.07161v1/x45.png)Figure 7:Per\-problem end\-to\-end success rate with no noise injected\.![Refer to caption](https://arxiv.org/html/2605.07161v1/x46.png)Figure 8:Per\-problem end\-to\-end success rate under noisy conditions\.
SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

Similar Articles

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

SEAGym: An Evaluation Environment for Self-Evolving LLM Agents

AI systems often fail in ways that don’t show up in testing?

We built a public archive of AI failure patterns. The ones that keep coming back after changes.

Submit Feedback

Similar Articles

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering
SEAGym: An Evaluation Environment for Self-Evolving LLM Agents
AI systems often fail in ways that don’t show up in testing?
We built a public archive of AI failure patterns. The ones that keep coming back after changes.