Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

arXiv cs.AI 05/18/26, 04:00 AM Papers
Summary
A controlled study of compound LLM agent design in an adversarial POMDP (CybORG CAGE-2), systematically varying context, reasoning, and hierarchy across five model families. Key findings: programmatic state abstraction yields large returns per token, hierarchy without deliberation tools achieves best absolute performance, and context engineering is more cost-effective than deeper reasoning.
arXiv:2605.16205v1 Announce Type: new Abstract: Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.
Original Article
View Cached Full Text
Cached at: 05/18/26, 06:40 AM
# Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP
Source: [https://arxiv.org/html/2605.16205](https://arxiv.org/html/2605.16205)
\(2026\)

###### Abstract\.

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several interacting design dimensions: \(1\) what the agent sees, \(2\) how it reasons, and \(3\) how tasks are decomposed across components\. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs\. We present a controlled study of compound LLM agent design in CybORG CAGE\-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process \(POMDP\)\. Reward is non\-positive, so all configurations operate in a failure\-mitigation mode and errors compound over time\. Our evaluation spans five model families, six models, and twelve configurations \(3,475 episodes\) with token\-level cost accounting\. We systematically vary context representation \(raw observations vs\. a deterministic, programmatic environment state\-tracking layer with compressed history\), deliberation \(self\-questioning, self\-critique, and self\-improvement tools, with optional chain\-of\-thought prompting\), and hierarchical decomposition \(monolithic ReAct vs\. delegation to specialized sub\-agents\)\. We find that: \(1\) Programmatic state abstraction delivers the largest returns per token spent \(RPTS\), improving mean return by up to 76% over raw observations\. \(2\) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3\.4×\\timesworse mean return while using 1\.8\-2\.7×\\timesmore tokens\. We call this destructive interaction pattern a*deliberation cascade*\. \(3\) Hierarchical decomposition without deliberation tools achieves the best absolute performance for most models, and context engineering is generally more cost\-effective than deliberation\. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per\-agent reasoning, as these strategies can interfere when combined\.

compound AI systems, LLM agents, hierarchical agent architectures, tool\-augmented language models, context engineering, inference\-time scaling, token efficiency, cost–performance trade\-offs, adversarial POMDP, autonomous cyber defense, catastrophic failures

††copyright:rightsretained††journalyear:2026††doi:10\.1145/3786335\.3813149††conference:ACM Conference on AI and Agentic Systems; May 26–29, 2026; San Jose, CA, USA††booktitle:ACM Conference on AI and Agentic Systems \(ACM CAIS ’26\), May 26–29, 2026, San Jose, CA, USA††isbn:979\-8\-4007\-2415\-2/26/05††ccs:Computing methodologies Artificial intelligence††ccs:Computing methodologies Partially\-observable Markov decision processes††ccs:Computing methodologies Multi\-agent systems††ccs:Computing methodologies Intelligent agents††ccs:Security and privacy Network security## 1\.Introduction

Practitioners build compound LLM agents by composing three design dimensions:*context engineering*\(Karpathy,[2025](https://arxiv.org/html/2605.16205#bib.bib8)\)\(what the agent sees\),*deliberation*\(reasoning depth\), and*hierarchical decomposition*\(task distribution\)\. While often assumed to be additive, these choices interact destructively in adversarial, partially observable sequential environments\. We present a controlled empirical study of these interactions within a compound LLM agent defending a network in the CybORG CAGE\-2 POMDP\. By systematically ablating context, reasoning, and hierarchy across five model families and six models, we measure both task performance and token cost, revealing that*what the agent sees*is a more reliable lever than*how deeply it thinks*: deterministic programmatic context yields large gains at near\-zero marginal cost, whereas distributing deliberation across a hierarchy often degrades performance while inflating token consumption\. We term this failure mode a*deliberation cascade*\. While bounded hierarchy often achieves the best*absolute*return, context engineering delivers the best*returns per token*, making it the most cost\-effective first investment\.

##### The Empirical Gap

Three gaps motivate this work\. \(1\) Multi\-agent research emphasizes*topology*\(wiring\) over internal agent design\(Kimet al\.,[2025](https://arxiv.org/html/2605.16205#bib.bib10)\), leaving the interaction between internal configuration and hierarchy underexplored\. \(2\) While context engineering is widely endorsed\(Karpathy,[2025](https://arxiv.org/html/2605.16205#bib.bib8)\), controlled evidence on the cost\-performance trade\-offs of specific context components in sequential settings is limited\. \(3\) The deliberation and multi\-agent literatures remain disconnected, and we show that combining them can introduce failure modes invisible to either in isolation\.

##### Research Question

Our central question is:which compound\-agent design dimensions deliver performance gains per token invested, and how do they interact when composed across a multi\-agent hierarchy?We decompose this into:RQ1\(Context\): value of programmatic abstraction vs\. raw observations;RQ2\(Reasoning\): impact of deliberation tools in monolithic vs\. hierarchical agents; andRQ3\(Composition\): when decomposition helps or hurts\.

##### Contributions

We present four contributions at the intersection of architectural design and system optimization\. \(1\)Cost\-effective context engineering\.We show that a deterministic state\-tracking layer reduces cumulative penalty by 52\-76% relative to raw observations for four of six models, dominating raw observation context configurations at near\-zero marginal cost\. \(2\)Identification of deliberation cascades\.We demonstrate that enabling deliberation tools across a hierarchy degrades performance in all six models \(up to3\.4×3\.4\{\\times\}worse return\) while doubling token costs, producing cascading uncertainty\. \(3\)Three\-axis Pareto analysis\.We conduct a controlled ablation of context, reasoning, and hierarchy across five model families \(72 pairs, 3,475 episodes\), constructing cost\-performance frontiers that consistently place programmatic context on the efficient frontier\. \(4\)Multi\-model validation\.We show that while qualitative effects \(context helps, distributed reasoning hurts\) are robust, quantitative magnitudes vary, validating multi\-model evaluation as essential for compound AI design\.

##### Scope

This paper is a static architectural design\-space study: it asks what compound\-agent architecture to build at deployment time, before any runtime adaptation\. We scope our claims to structured adversarial POMDPs\.

## 2\.Background & Problem Definition

We study compound LLM agents operating in an adversarial, partially observable sequential decision problem and evaluate designs jointly on \(1\) task return and \(2\) token cost\. The agent architecture and the multi\-dimensional ablation study mirror our engineering ladder for solving CAGE\-2 with an LLM agent: we began with raw environment observations, then introduced deterministic state tracking and context engineering to make the observations actionable, decomposed the task into a hierarchy as context grew, and finally added deliberation tools to facilitate better decision\-making within each agent\. This section defines the environment and objectives, specifies the execution scaffold and initialization scope, and establishes tokens as the cost primitive\.

##### CybORG CAGE\-2

We evaluate on CybORG CAGE\-2\(TTCP CAGE Challenge Working Group,[2022](https://arxiv.org/html/2605.16205#bib.bib4); Standenet al\.,[2021](https://arxiv.org/html/2605.16205#bib.bib16)\), an adversarial POMDP modelling network defense\. A defender protects a 13\-host network against an attacker that follows a scripted, non\-adaptive multi\-stage kill chain\(Kielyet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib9)\)overT=30T\{=\}30steps\. The attacker does not respond to the defender’s actions, but host attributes, processes, and the progression of the kill chain vary stochastically across runs\. The defender chooses from five actions \(Monitor,Analyse,Remove,Restore,Decoy\) with asymmetric costs\. Rewardrt≤0r\_\{t\}\\leq 0penalizes compromise and intervention; we report episodic returnG=∑t=1TrtG=\\sum\_\{t=1\}^\{T\}r\_\{t\}\(closer to zero is better\)\. Partial observability requires sustaining situational awareness under noisy indicators\. \(See Appendix[H](https://arxiv.org/html/2605.16205#A8)for full details\.\)

##### Configuration\-Driven ReAct Scaffold

Agents follow a ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib17)\)loop, iterating between deliberation and tool use until emitting an answer\. We separate a reusable ReAct Agent engine \(I/O, parsing\) from declarative YAML ”personality” configurations \(prompts, tools\)\. This separation ensures reproducibility and allows architectural variants to be defined as configuration changes rather than code modifications\.

##### Knowledge\-Free Initialization

To isolate architectural effects from domain engineering, we impose a knowledge\-free scope att=0t\{=\}0\. Agents receive only a one\-sentence role instruction and a compact action reference table\. No network topology, attacker kill\-chain details, host\-value assignments, or defense heuristics are provided, and the prompt never mentions the environment by name\. Performance gains arise from model internal knowledge, runtime context, and deliberation\. Models’ internal knowledge may include information about CAGE\-2 acquired during pre\-training\. To minimize the consequences of this exposure, no benchmark\-specific cues are provided at runtime\. However, influence from pre\-training may still be present\. All LLMs receive an identical set of prompts and tools\.

##### Tokens as Cost Primitive

We use total number of tokens per episode \(prompt \+ completion\) as the primary cost metric, mapping directly to billed usage and correlating with latency\. We use raw token counts because they do not depend on provider pricing\. Appendix[F](https://arxiv.org/html/2605.16205#A6)reports prompt/completion splits by model and configuration for pricing\-weighted reinterpretation\. We instrument all LLM calls, aggregating prompt and completion tokens across the Planner and sub\-agents\. Pairing return with token cost enables our central evaluative question: which design choices deliver the largestreturns per token spent\(RPTS\), and which inflate the cost\.

Table 1\.System Modularity and Configuration Layers\. The agent architecture is built on four functional pillars, ordered from the system’s top\-level decision structure down to its internal reasoning mechanisms\.

## 3\.Compound Agent System Design

Our system couples a deterministic backbone with an LLM decision engine and spans four layers \(Table[1](https://arxiv.org/html/2605.16205#S2.T1)\)\. \(1\) Thehierarchy layerallows a Planner to delegate tasks to Analyst and ActionChooser sub\-agents via strict JSON contracts\. \(2\) Adeterministic infrastructure layermaintains a primitive environment model, history of interactions, and validates actions without model calls\. \(3\) Thecontext engineering layercomprises injections that extend user prompts and connect environment model state and action history by converting them into structured summaries like\{network\_status\}\. \(4\) Thereasoning layerfollows the ReAct pattern\(Yaoet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib17)\), optionally executing deliberation within ReAct loop\.

##### Decision Cycle

At each of theT=30T\{=\}30steps, an Agent Coordinator receives the raw CybORG observation, updates the deterministic environment model, and inserts the chosen context bundle into the Planner prompt\. It then startsa fresh step\-level Planner instance, validates the emitted action, retries on invalid outputs, and submits the validated action to the environment \(Figure[1](https://arxiv.org/html/2605.16205#S3.F1)\)\. Step\-level instantiation ensures no hidden conversational state accumulates across steps\. All inter\-step continuity is explicit in the deterministic state structures and compiled context\.

##### Reliability Mechanisms

In an adversarial sequential environment, every invalid action is a wasted step during which the attacker advances unopposed\. We therefore treat reliability as load\-bearing infrastructure implementing the following: \(1\)Action validation and retry:the Planner’s output is parsed against the CybORG action schema using regex\-based validation; invalid outputs trigger up to three retries with the parsing error injected as feedback\. \(2\)Safe fallback:if all retries fail, the system defaults toMonitor, preserving observability without risking a misapplied intervention\. \(3\)Sub\-agent output validation:the ActionChooser’s JSON undergoes tiered parsing \(direct parse→\\rightarrownormalization→\\rightarrowoptional repair prompt\)\. Any repair calls are included in token accounting\. These mechanisms ensure format compliance\. They are designed to minimize harm when invoked\.

### 3\.1\.Layer 1: Hierarchical Decomposition

The system can run either as a*monolithic Planner*that directly emits an environment action, or as a fixed three\-agent hierarchy consistent with the centralized multi\-agent system \(MAS\) topology described by Kim et al\.\(Kimet al\.,[2025](https://arxiv.org/html/2605.16205#bib.bib10)\), in which an orchestrator coordinates bounded sub\-agents through structured communication\. The three\-agent split separates three functions: strategic decision\-making, localized perception, and bounded action candidate ranking to accommodate increasing context size\. ThePlannerretains sole authority over environment actions but may delegate to two sub\-agents whose outputs are advisory\. TheAnalystassesses a single host by comparing its current state against the effective baseline and returns a structured JSON assessment \(status, anomalies, suspected compromise\)\. TheActionChooserreceives the Planner’s situational summary and, when available, theAnalyst’sreport, returning a ranked list of up to three candidate actions with confidence labels in strict JSON\. Sub\-agents cannot issue environment actions\. The purpose of this split is to simplify the Planner’s decision by constraining what each sub\-agent produces\. All three agents share the same engine\-personality separation logic, ensuring the observed hierarchy effects reflect role decomposition\.

### 3\.2\.Layer 2: Deterministic Environment Model

To provide in\-context situational awareness, the system maintains a fully deterministic environment model that transforms raw CybORG observations into structured network state description and action history\. Critically, this model is computed exclusively from the agent’s own observations and past actions\. The system stores environment*baseline state*upon the first observation and creates a host\-indexed data structure\. This layer embeds domain\-informed observation processing\. The engineering choices that form the model shape what the agent perceives\. Although the layer prescribes no action\-selection decisions \(no threat rubrics, no host priorities, and no response heuristics\), it still carries inductive bias that may influence the agent’s reasoning\.

##### Dynamic Environment Model

The model data structure comprises a dictionary where each host obtains astatus: \{baseline,changed,unknown,analysed at stepnn\}, and ahistory record, an ordered, arrow\-delimited record of all actions applied to each host \(e\.g\.,Analyse→\\rightarrowRemove→\\rightarrowRestore\)\. This concise single\-host record provides the Planner with additional intervention memory without requiring it to parse a full transcript\.

##### Model update mechanism\.

On each step, the model compares the current observation against the saved*baseline state*using signature\-based comparison of stable fields \(process and service identity\), ignoring volatile fields \(e\.g\., transient connections\) that produce false positives\(Kielyet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib9)\)\. The system deterministically updates host status based on the comparison and prior actions \(Restore→\\rightarrowbaseline;Remove→\\rightarrowunknown;Analyse→\\rightarrowanalysed at stepnn\)\. Deployed decoys are registered as baseline overrides so that expected decoy processes are incorporated into the effective baseline and only genuinely new processes remain visible as anomalous\.

![Refer to caption](https://arxiv.org/html/2605.16205v1/system_v3.png)Figure 1\.End\-to\-end system architecture\. The deterministic layer \(left\) compiles structured context from CybORG observations and assembles the agent prompt\. The Planner \(right\) executes a ReAct loop, optionally delegating to Analyst and ActionChooser sub\-agents, before emitting a validated action back to the environment\.System diagram showing the deterministic infrastructure on the left \(ReAct template, knowledge, prompt construction, network status and history\) feeding into the Planner on the right, which contains deliberation tools and optional sub\-agents \(ActionChooser, Analyst\)\. The Planner emits an action to the CAGE\-2 environment, which returns the next observation\.

### 3\.3\.Layer 3: Context Engineering

##### Context Injections

Three context injections feed the Planner’s initial prompt viaplaceholders:\{observation\},\{history\}, and\{network\_status\}\.\{observation\}is the raw CybORG dictionary: a verbose, noisy dump of per\-host process tables, network connections, and service states that the LLM must parse unaided\.\{network\_status\}is a compact JSON list of only non\-baseline hosts, each annotated with status \(changed,unknown,analysed\), recency \(Currentvs\.Past\), and the action history applied to that host\. It collapses to a single sentence when all hosts are healthy\.\{history\}is a compressed action log that folds consecutive quiet steps \(Monitor, no state change\) into ranges while preserving full detail for intervention steps, controlling prompt growth over the 30\-step episode\.\{history\}thus serves a dual function: it provides temporal context for intervention sequencing and supplies the Planner’s own prior reasoning through programmatically extracted justifications, creating a compressed log of past decisions with explanations\.

##### The Initial User Prompt

The initial user prompt template itself contributes nothing beyond a step counter and a closing question\. All environment understanding comes from the content injected into theseplaceholders\. Each sub\-agent also receives a role\-specific initial prompt and context components provided by the Planner\. The Analyst receives the target hostname and is asked to assess the host’s situation and the ActionChooser receives a situational JSON from the Planner containing the target host, threat description, severity level, and relevant context from prior steps\. Sub\-agents see neither the full network status nor the episode history\. The Planner decides what to provide, enforcing limited context per role\. All structured injections are fully deterministic and their cost is limited to the marginal tokens they add to the prompt \(examples in Appendix[B](https://arxiv.org/html/2605.16205#A2)\)\.

### 3\.4\.Deliberation Tools

To further increase decision\-making capabilities each ReAct agent in the hierarchy supports four cumulative levels of deliberation\. The deliberation tools implement a self\-questioning, self\-critique, and self\-refinement cycle inspired by Self\-Refine\(Madaanet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib12)\)and self\-interrogation techniques\(Presset al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib25)\); the CoT injection follows Wei et al\.\(Weiet al\.,[2022](https://arxiv.org/html/2605.16205#bib.bib22)\)and Kojima et al\.\(Kojimaet al\.,[2022](https://arxiv.org/html/2605.16205#bib.bib23)\)\. Unlike cross\-episode reflection \(e\.g\., Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib15)\)\), all deliberation occurs within a single step and carries no memory to future steps\.

The following three*deliberation tools*and an explicit chain\-of\-thought \(CoT\) prompt injection are executed as additional ReAct turns inside one loop*within a single environment step*: \(1\)question: the agent questions its initial reasoning before committing to an action\(Presset al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib25)\)\. \(2\)critique\(includes \#1\): the agent generates an explicit critique of its response\(Madaanet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib12)\)\. \(3\)improve\(includes \#1\-2\): the agent revises its action in light of the critique\(Madaanet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib12)\)\. \(4\)COT\(includes \#1\-3\): an explicit chain\-of\-thought instruction\(Weiet al\.,[2022](https://arxiv.org/html/2605.16205#bib.bib22); Kojimaet al\.,[2022](https://arxiv.org/html/2605.16205#bib.bib23)\)is added to the system prompt, providing reasoning scaffolding on top of the tools\.

## 4\.Experimental Methodology

This section describes the models, evaluation protocol, metrics, and controlled ablation design that produce the evidence base for our findings\. Table[2](https://arxiv.org/html/2605.16205#S4.T2)provides an at\-a\-glance summary\.

Table 2\.Experimental overview, models evaluated, and evaluation hyperparameters\.##### Models

We evaluate six models from five contemporary model families \(Table[2](https://arxiv.org/html/2605.16205#S4.T2)\) accessed via OpenRouter \(Grok, Llama, Devstral, Qwen\) and Google Cloud Gemini API\. We adopt a full\-coverage design: all six models are evaluated on all 12 configurations across three experimental axes \(72 unique model\-configuration pairs\)\. This safeguards against reliance on single\-model results\. All models use deterministic decoding \(temperature 0\)\. All models receive identical prompt templates and tool definitions\. No per\-model tuning is performed\.

##### Evaluation Protocol

Each configuration is evaluated over multiple*containerized agent instances*and multiple*runs*per instance\. Each run is launched with a unique seed and fresh agent state\. The standard allocation is1010instances×\\times55runs=50=50episodes per pair\. G3FP uses a reduced 25\-episode budget\. Several G2\.5FL/Qwen configs use extended batches \(75\-100\) to resolve uncertainty on key comparisons\. These allocation differences do not affect qualitative conclusions, which are validated across all six models \(Appendix[I](https://arxiv.org/html/2605.16205#A9)\)\. Step\-level fresh instantiation ensures token accounting is precise and isolated\.

##### Three\-Axis Design with Shared Anchor configuration

The experimental design varies three axes\. \(1\)Axis 1 \(Context\):6 monolithic Planner configurations varying\{obs\},\{hist\}, and\{net\}placeholders \(Table[1](https://arxiv.org/html/2605.16205#S2.T1)\)\. \(2\)Axis 2 \(Deliberation\):4 cumulative levels adding\+question,\+critique,\+improve, and\+COTtools to the monolithic Planner\. \(3\)Axis 3 \(Hierarchy\):2 configs:hier\-base\(delegation to Analyst/ActionChooser, deliberation tools OFF\) andhier\-delib\(delegation \+ deliberation tools ON on all agents, CoT inactive\)\. Axes 2 and 3 share a common default, thehist\+netconfiguration \(structured state \+ compressed history, no raw obs, no deliberation, no delegation\), which emerged from our engineering ladder as the default compound\-agent context setting\. Axis 1 treats this same configuration as one of six context variants\.

##### Metrics and Statistical Reporting

We reportmean episode return\(sum of rewardsrt≤0r\_\{t\}\\leq 0, closer to zero is better\) as the primary performance metric, alongsidetotal tokens per episode\(prompt \+ completion\) as the cost primitive\. To jointly evaluate cost and performance, we definereturns per token spent\(RPTS\) as the mean return improvement over the observation\-only baseline per kilotoken consumed:

RPTS=Gconfig−GobsKTokconfig\\text\{RPTS\}=\\frac\{G\_\{\\text\{config\}\}\-G\_\{\\text\{obs\}\}\}\{\\text\{KTok\}\_\{\\text\{config\}\}\}whereGobsG\_\{\\text\{obs\}\}is the per\-model mean return under theobsconfiguration andKTokis total tokens \(thousands\) per episode\. RPTS identifies which design choices deliver the largest gains per unit of inference budget; negative values indicate configurations that perform worse than unstructured observation alone\. To capture tail risk in this failure\-mitigation setting, we reportstandard deviationandcatastrophic failure rate\(fraction of episodes<−150<\-150\)\. We adopt*multi\-model replication*as the primary validity guard: findings must replicate across the majority of the tested models\.

##### Reward\-Scale Reference Points

To ground the reward scale, we reference the public CAGE\-2 leaderboard\(TTCP CAGE Challenge Working Group,[2022](https://arxiv.org/html/2605.16205#bib.bib4)\)\(30\-step, B\_line setting\)\. Reference reward values are listed in Table[2](https://arxiv.org/html/2605.16205#S4.T2)\.

## 5\.Results

All configurations follow the protocol and metrics defined in Section[4](https://arxiv.org/html/2605.16205#S4)\. Table[3](https://arxiv.org/html/2605.16205#S5.T3)presents the complete results matrix\. Appendix[E](https://arxiv.org/html/2605.16205#A5)reports additional statistical support, including 95% confidence intervals for mean returns and paired difference confidence intervals for key comparisons\. These analyses preserve the qualitative direction of the main findings while highlighting model\-dependent uncertainty\. We organize the analysis into four subsections corresponding to context \(RQ1\), the interaction between deliberation and hierarchy \(RQ2,RQ3\), cost\-performance frontiers, and robustness\.

Table 3\.Full results matrix\. Mean episode return±\\pmstandard deviation \(Ret±\\pmSD\) and kilotokens per episode \(KTok = tokens×103\\times 10^\{3\}\) across all 72 pairs, grouped by experimental axis\.Underline↑\\,\{\\scriptstyle\\uparrow\}= best within axis;box✓\\,\{\\scriptstyle\\checkmark\}= best overall per model;↓\\downarrow= worst within axis;box↓\\,\{\\scriptstyle\\downarrow\}= worst overall per model\. The shared anchor configuration hist\+net†serves as baseline for Axes 2 and 3\.### 5\.1\.Finding 1: Programmatic State Abstraction Delivers the Largest Gains Per Token

Context engineering ablations reveal that a deterministic layer compiling observations into structured summaries delivers the largest return on token investment\.

![Refer to caption](https://arxiv.org/html/2605.16205v1/fig01_context_heatmap.png)Figure 2\.Context engineering heatmap\. Each cell shows mean episode return for one model\-context pair\. Darker shading indicates worse \(more negative\) returns\. Theobs\+netandhist\+net\(Planner\-Only\) columns are consistently strong across models, whileobsalone is typically poor\.Heatmap with six model rows and six context configuration columns\. Cell colour intensity encodes mean return\.#### 5\.1\.1\.Raw observation alone is ineffective\.

Feeding unprocessed CybORG dictionaries \(obs\) yields the worst or near\-worst performance for five of six models \(leftmost column in Figure[2](https://arxiv.org/html/2605.16205#S5.F2), Table[3](https://arxiv.org/html/2605.16205#S5.T3)\)\. Llama, Qwen, and G2\.5FL fall below−214\-214mean return, approaching the Sleeping \(no\-op\) agent \(−219\-219\), with catastrophic failure rates \(<−150<\-150\) reaching 96\-98% \(Figure[5](https://arxiv.org/html/2605.16205#S5.F5)\)\. The raw format’s verbosity and noise overwhelm the planner\.

#### 5\.1\.2\.Programmatic backbone enables raw data utility\.

Augmenting raw observations with the deterministic\{network\_status\}layer \(obs→\\toobs\+net\) transforms performance without additional LLM calls\. Llama improves by 76% \(−214\.7→−51\.4\-214\.7\\to\-51\.4\), Qwen by 71%, and Devstral by 53% \(Table[3](https://arxiv.org/html/2605.16205#S5.T3)\)\. The contrast is visible in Figure[2](https://arxiv.org/html/2605.16205#S5.F2), where theobs\+netcolumn is consistently lighter thanobs\.

#### 5\.1\.3\.Less\-but\-structured often beats more\-but\-unstructured\.

Thehist\+netanchor configuration \(structured state \+ history, no raw obs\) matches or beats the maximum\-informationobs\+hist\+netfor four models \(Llama, G3FP, Qwen, Devstral\)\. Adding raw observations to a clean state summary often dilutes the signal; removing them improves Llama’s return by 44% \(compare thehist\+netandobs\+hist\+netcolumns in Figure[2](https://arxiv.org/html/2605.16205#S5.F2)\)\.

#### 5\.1\.4\.History provides complementary temporal context\.

Adding compressed action history to raw observations \(obs→\\toobs\+hist\) improves all six models, with Llama gaining 36% and Qwen 57% \(Table[3](https://arxiv.org/html/2605.16205#S5.T3)\)\. However, history’s marginal value depends on what other context is present\. When network status is already available, adding history \(obs\+netvs\.obs\+hist\+net\) hurts four of six models \(Table[3](https://arxiv.org/html/2605.16205#S5.T3)\), suggesting that\{network\_status\}already encodes much of the decision\-relevant temporal signal\.

### 5\.2\.Finding 2: Deliberation Destructively Interferes with Hierarchy

The intersection of Axes 2 and 3 reveals the study’s central failure mode: deliberation that helps monolithic agents can degrade hierarchical systems\.

![Refer to caption](https://arxiv.org/html/2605.16205v1/fig03_compound_metacognition.png)Figure 3\.Deliberation cascade effect\. Paired bars show mean episode return forhier\-base\(dark\) vs\.hier\-delib\(light\) across six models\. Degradation ratio annotated above each pair\. Mean return decreases for all six models\.Paired bar chart comparing hier\-base and hier\-delib across six models with degradation ratios\.#### 5\.2\.1\.Monolithic deliberation is model\-dependent

Adding reasoning tools to the monolithic Planner produces model\-dependent effects \(Table[3](https://arxiv.org/html/2605.16205#S5.T3), Axis 2\)\. Explicit chain\-of\-thought \(\+COT\) is the best deliberation level for four models \(Grok, Devstral, Qwen, G3FP\), often improving on the anchor configuration\. However, patterns are non\-monotonic, and stronger baselines \(e\.g\., Llama on anchor configuration\) can be destabilized by additional turns\. Token costs scale steeply:\+COTconsumes33\-10×10\\timesmore tokens than the anchor\.

#### 5\.2\.2\.Hierarchy without deliberation wins

Thehier\-baseconfiguration \(delegation only, no deliberation tools\) achieves the best or near\-best absolute performance for four of six models \(G3FP at−16\.1\-16\.1, Grok at−24\.0\-24\.0, approaching the top published DRL result of−3\.47\-3\.47\(CardiffUni Team,[2022](https://arxiv.org/html/2605.16205#bib.bib5)\); dark bars in Figure[3](https://arxiv.org/html/2605.16205#S5.F3)\)\. The benefit comes from*task decomposition*, bounded Analyst assessments and ActionChooser rankings, rather than deeper deliberation\. Llama is the exception, degrading by 22% compared to monolithichist\+net\(−57\.1→−69\.5\-57\.1\\to\-69\.5\)\.

![Refer to caption](https://arxiv.org/html/2605.16205v1/fig04_pareto_scatter.png)Figure 4\.Cost\-performance Pareto frontiers\. Points shaped by axis \(circles: context, triangles: deliberation, squares: hierarchy\)\. Context configurations occupy the low\-cost efficient region;hier\-delibis high\-cost and dominated\.Six scatter subplots with tokens on x\-axis and mean return on y\-axis\. Points shaped by axis\. Dashed Pareto frontier lines connect non\-dominated configurations\.
#### 5\.2\.3\.Deliberation cascade: the destructive interaction

Enabling deliberation tools on all hierarchical agents \(hier\-delib\) decreases mean return relative tohier\-basefor all six models \(Table[3](https://arxiv.org/html/2605.16205#S5.T3), Figure[3](https://arxiv.org/html/2605.16205#S5.F3)\)\. Devstral worsens by3\.37×3\.37\\times\(−37\.8→−127\.4\-37\.8\\to\-127\.4\), Grok by1\.68×1\.68\\times, and Llama by1\.55×1\.55\\times, while token costs typically double \(1\.81\.8\-2\.7×2\.7\\times\)\.

This compositional failure arises from*deliberation cascade*: independent deliberation loops in the Analyst, ActionChooser, and Planner amplify uncertainty without an arbitration protocol\. Moreover, for four of six models \(Grok, Devstral, Qwen, G3FP\),hier\-basematches or outperforms the best monolithic deliberation at comparable or lower token cost, achieving through task decomposition what deliberation attempts through deeper reasoning\. The degradation underhier\-delibis not attributable to token overhead per se:hier\-baseconsumes comparable tokens to monolithic\+COT\(e\.g\., Grok: 141\.9K vs\. 144\.5K\) yet achieves substantially better returns \(−24\.0\-24\.0vs\.−42\.9\-42\.9\), confirming the loss is structural\.

##### Mechanistic evidence: passivity amplification under distributed deliberation

Auditing a matched episode trace from Devstral run reveals a recurring pattern ofpassivity amplification: under low\-severity contexts with multiple active decoys, the deliberative ActionChooser converges on ”avoid redundant interventions” and defers remediation until evidence becomes unambiguous\. For example: at step 24 User4 has been recently analyzed and instrumented with multiple decoys\. The ActionChooser’s deliberation prioritizes passive observing:Monitor\(0\.9\)\>\>Analyse User4\(0\.8\)\>\>Restore User4\(0\.7\)\. The Planner adopts the top recommendation\. On the next step, the Analyst flagsseverity=highwith concrete anomalies \(unusual outbound traffic, unknown processes\), and the Planner overrides and restores User4, with a critical delay as in a failure\-mitigation regime, even single\-step deferral prolongs high\-penalty states\. Distributed deliberation overestimates immediate ”stability” cues under partial observability, yielding high\-confidence deferral until evidence becomes overwhelming \(See Appendix[G](https://arxiv.org/html/2605.16205#A7)\)\.

Table 4\.RPTS across all configurations \(Section[4](https://arxiv.org/html/2605.16205#S4.SS0.SSS0.Px4)\)\. Higher is better; negatives = worse thanobs\.Bold= best per model\.

### 5\.3\.Finding 3: Context Engineering Dominates the Cost\-Performance Frontier

We construct per\-model Pareto frontiers over all twelve configurations \(tokens per episode vs\. mean return, Figure[4](https://arxiv.org/html/2605.16205#S5.F4)\)\. Three patterns emerge\. \(1\) Across per\-model Pareto frontiers, at least one context configuration is efficient for every model, typicallyobs\+netorhist\+net\. \(2\) Deliberation configurations reach the frontier for only a minority of models and are often dominated by context or hierarchy\. \(3\)hier\-basesets the high\-performance frontier where hierarchy helps, whilehier\-delibis consistently dominated byhier\-base, combining higher token cost with worse mean return\.

![Refer to caption](https://arxiv.org/html/2605.16205v1/fig17_failure_heatmap.png)Figure 5\.Catastrophic failure rate \(return<−150<\-150\) by model and configuration\. G2\.5FL fails across all configurations; context engineering reduces catastrophic rates for most other models\.Heatmap with six model rows and twelve configuration columns showing catastrophic failure percentages\. G2\.5FL row is uniformly dark\.##### Return per Token Spent

Table[4](https://arxiv.org/html/2605.16205#S5.T4)reports RPTS for all non\-baseline configurations\. Context configurations dominate:obs\+netachieves the highest RPTS for Grok, Llama, Devstral, and Qwen, whilehist\+netleads for G3FP\. G2\.5FL is the exception, with\+critiqueperforming best, consistent with its difficulty exploiting structured context\. Hierarchy often improves absolute return but is less token\-efficient:hier\-baseconsumes substantially more tokens thanobs\+net, yielding lower RPTS despite stronger mean return\.hier\-delibremains high\-cost and low\-efficiency relative tohier\-base\. As a pricing sensitivity check, we re\-weighted token costs using provider\-specific input/output pricing ratios \(Appendix[F](https://arxiv.org/html/2605.16205#A6)\); this did not reverse any qualitative conclusion, although it narrowed the relative cost advantage of context over hierarchy\.

### 5\.4\.Robustness, Variance, and Tail Risk

Qualitative effects \(context helps, distributed deliberation hurts\) are consistent across models, and magnitudes vary by22\-10×10\\times\. Standard deviations \(SD\) of episode return, reported alongside means in Table[3](https://arxiv.org/html/2605.16205#S5.T3), reveal how each design axis affects outcome variability\. Context engineering compresses both mean and variance\. Programmatic state abstraction reduces SD alongside mean return \(e\.g\., Grok: 70 to 40, Devstral: 65 to 48\) and reduces catastrophic failure \(<−150<\-150\) from\>90%\>90\\%underobsto<10%<10\\%in the strongest cases \(Figure[5](https://arxiv.org/html/2605.16205#S5.F5)\)\. Bounded hierarchy further tightens outcomes: G3FP underhier\-baseachieves SD==3, and Grok’s SD falls from 79 to 27\. Conversely,hier\-delibincreases variance: Devstral’s SD rises from 37 to 72 and its catastrophic rate jumps from 2% to 44%, consistent with the deliberation cascade\. G2\.5FL fails catastrophically \(\>48%\>48\\%\) across all configurations, suggesting a capability floor for the structured I/O compliance the architecture requires; positive claims do not depend on G2\.5FL\. Hierarchy provides limited or negative value for two models\. Llama worsens under hierarchy \(−57\.1→−69\.5\-57\.1\\to\-69\.5, 22%\) with SD remaining high at 60\. G2\.5FL improves only marginally \(12%;−208\.7→−183\.1\-208\.7\\to\-183\.1\)\. The remaining four models improve substantially \(52\-79%\)\. Multi\-model evaluation is thus essential: a Llama\-only study would conclude hierarchy hurts, whereas a Grok\-only study would assert it is important\.

## 6\.Discussion

Our findings share a unifying theme: in adversarial sequential POMDPs, the value of a design choice is determined by*information flow*through the system rather than per\-component merit\. We distill three design principles \(RQ1\-RQ3\)\.

##### Principle 1: Invest in deterministic infrastructure before LLM reasoning \(RQ1\)\.

The programmatic state\-tracking layer delivers the largest consistent gains per token by shifting the LLM from perception\-plus\-reasoning to reasoning\-over\-state\. Knowledge\-free agents rely solely on this scaffolding, so gains come from*presentation*and*uncertainty compression*, not domain expertise\. Context engineering also compresses tail risk: catastrophic failure \(return<−150<\-150, roughly the Random Agent level\) drops from\>90%\>90\\%underobsto<10%<10\\%underobs\+netorhist\+netin the strongest cases \(Figure[5](https://arxiv.org/html/2605.16205#S5.F5)\), making it the most reliable lever for average and worst case\.

##### Principle 2: Decompose into bounded specialists, not reflective generalists \(RQ2,RQ3\)\.

Hierarchy without deliberation \(hier\-base\) achieves best or near\-best absolute performance for four of six models\. The benefit is*interface constraints*: the Analyst gives a bounded assessment and the ActionChooser a ranked list, turning an open\-ended generation problem into a verifiable decision\.

##### Principle 3: Do not distribute deliberation without an uncertainty\-resolution protocol \(RQ2,RQ3\)\.

Enabling deliberation across all hierarchical agents degrades performance and inflates cost\. Independent critique loops create*cascading uncertainty*\. When a sub\-agent critiques its own answer, it introduces excessive qualifications\. The consuming agent cannot distinguish these from genuine warnings about the environment, so caution accumulates through the hierarchy\. If deliberation is needed across a hierarchy, centralize it or use explicit mediation \(e\.g\., confidence gating, aggregation rules\)\.

##### Model dependence and practical ordering\.

Effect directions \(context helps, distributed deliberation hurts\) hold across models, but magnitudes vary by22\-10×10\\timesand some effects flip \(e\.g\., Llama harmed by hierarchy\)\. Single\-model studies would thus contradict each other\. The best first step differs by model \(context for weak raw\-observation handling, hierarchy for strong baselines\)\. The observed Pareto ordering follows our engineering trajectory: \(1\) context is the most reliable lever; \(2\) bounded hierarchy sets the ceiling; \(3\) monolithic deliberation is capability\-conditional; \(4\) distributed deliberation is dominated\. Our results complement topology\-focused scaling\(Kimet al\.,[2025](https://arxiv.org/html/2605.16205#bib.bib10)\)by showing that internal configuration determines whether decomposition helps or hurts, revealing failure modes invisible to topology\-only analyses\. Each layer builds on the previous, and the value of hierarchy and deliberation is conditional on the quality of infrastructure and context\.

##### Transferability\.

The effect directions \(Principles 1–3\) replicate across six models from five families, making them directional starting points rather than hard prescriptions; specific magnitudes are tied to CAGE\-2 and the models tested\. A practitioner should: \(1\) add structured programmatic context \(environment model, history tracking, observation decomposition, baseline state\); \(2\) compare monolithic configurations against raw observations; \(3\) test bounded hierarchy if feasible; and \(4\) avoid distributing deliberation tools across sub\-agents by default\. The state\-tracking layer is likely to transfer where observations are structured enough that a program can track what has changed and why, and most steps are routine so history compresses without losing important information\.

## 7\.Related Work

Our study sits at the intersection of four research streams that prior work typically addresses in isolation\.

##### Multi\-Agent Architecture and Scaling

Recent work categorizes coordination protocols\(Qianet al\.,[2025](https://arxiv.org/html/2605.16205#bib.bib1); Tranet al\.,[2025](https://arxiv.org/html/2605.16205#bib.bib2)\), derives topology scaling laws\(Kimet al\.,[2025](https://arxiv.org/html/2605.16205#bib.bib10)\), and proposes structured communication formats\(Khaled and Monticolo,[2026](https://arxiv.org/html/2605.16205#bib.bib20); Tanget al\.,[2025](https://arxiv.org/html/2605.16205#bib.bib19)\)\. These studies vary wiring while treating nodes as fixed; we hold wiring fixed and ablate*internal*configuration, revealing failure modes like deliberation cascades that are invisible to topology\-only analyses\.

##### Autonomous Cyber Defense

CybORG CAGE\-2\(TTCP CAGE Challenge Working Group,[2022](https://arxiv.org/html/2605.16205#bib.bib4); Kielyet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib9)\)has been addressed with RL\(Bateset al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib3)\), model\-based planning\(Hammaret al\.,[2024](https://arxiv.org/html/2605.16205#bib.bib7)\), and particle filtering\(Le and Stadler,[2025](https://arxiv.org/html/2605.16205#bib.bib21)\)\. LLM\-based defenders\(Mohammadiet al\.,[2025](https://arxiv.org/html/2605.16205#bib.bib13); Castroet al\.,[2025](https://arxiv.org/html/2605.16205#bib.bib6)\)are newer but lack controlled architectural ablations\. We provide the first cost\-performance analysis of compound LLM design decisions in CAGE\-2\.

##### Context Engineering

Context design is a critical lever\(Karpathy,[2025](https://arxiv.org/html/2605.16205#bib.bib8)\), supported by tooling ecosystems\(LangChain,[2025](https://arxiv.org/html/2605.16205#bib.bib11)\)and algorithmic context evolution\(Zhanget al\.,[2025](https://arxiv.org/html/2605.16205#bib.bib18)\)\. We contribute an orthogonal, controlled ablation of context*composition*\(raw vs\. structured\) in a POMDP, showing that deterministic state abstraction outperforms raw observations at near\-zero marginal cost\.

##### Deliberation and Self\-Critique

Intra\-step deliberation techniques, such as chain\-of\-thought\(Weiet al\.,[2022](https://arxiv.org/html/2605.16205#bib.bib22); Kojimaet al\.,[2022](https://arxiv.org/html/2605.16205#bib.bib23)\), self\-interrogation\(Presset al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib25)\), and self\-refinement\(Madaanet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib12)\), operate within a single inference call, unlike cross\-episode methods such as Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.16205#bib.bib15)\)\. Recent work shows that self\-correction without external feedback can be harmful in monolithic agents\(Huanget al\.,[2024](https://arxiv.org/html/2605.16205#bib.bib24)\), and Renze and Guven\(Renze and Guven,[2024](https://arxiv.org/html/2605.16205#bib.bib14)\)decompose reflection components but evaluate only monolithic settings\. No prior work studies deliberation*distributed*across a hierarchy\. Our deliberation cascade finding extends the single\-agent self\-correction limitation to compound systems, identifying a compositional failure mode invisible to either literature in isolation\.

## 8\.Limitations and Future Work

Our claims are scoped to*structured adversarial POMDPs*where deterministic state abstraction is feasible\. We use a single environment \(fixed topology, scripted adversary, 30\-step horizon\) and a single three\-agent hierarchy; alternative topologies may exhibit different deliberation cascade dynamics\. Token counts proxy cost but do not capture latency or pricing, and knowledge\-free prompts isolate architectural effects without fully disentangling pretrained priors\. Deliberation tools are tested only in cumulative activation order; independent activation may yield different interaction patterns\. Our model selection spans mid\-tier and efficiency\-focused families; frontier\-scale models may respond differently\. Our design space is static\. Priorities for future work include independent ablation of individual deliberation tools to isolate which components drive the cascade, selective deliberation placement within the hierarchy \(e\.g\., enabling deliberation on only one sub\-agent\), and evaluation on frontier\-scale models to test whether the observed effects persist at higher capability levels\. Extending ablations to diverse environments and designing inter\-agent uncertainty arbitration protocols, such as confidence gating or calibrated aggregation, are also important next steps\.

## 9\.Conclusion

We presented a controlled cost\-performance study of compound LLM agent design in an adversarial, partially observable sequential environment \(CybORG CAGE\-2\)\. Across a three\-axis ablation of context representation \(6 configurations\), deliberation depth \(4 cumulative levels\), and hierarchical decomposition \(2 configurations\), we evaluated 72 model\-configuration pairs spanning five model families, totaling 3,475 episodes and 283\.9M tokens\.

Three conclusions emerge\. First \(RQ1\),*context engineering dominates*: deterministic programmatic state abstraction yields the largest and most consistent gains per token, while raw observations alone are destabilizing\. Second \(RQ2\),*hierarchy can substitute for deliberation*: bounded specialist decomposition \(hier\-base\) achieves the best absolute performance for most models through strict I/O contracts rather than deeper per\-agent reasoning\. Third \(RQ3\),*deliberation is not modular*: distributing deliberation tools across a hierarchy \(hier\-delib\) produces a*deliberation cascade*that degrades returns while increasing token expenditure\.

A cross\-cutting finding reinforces these conclusions: qualitative effects hold across all six models, but magnitudes vary by22\-10×10\\timesand some reverse sign \(e\.g\., Llama is harmed by hierarchy\), validating multi\-model evaluation as essential for compound AI research\. These results suggest a practical starting point for structured adversarial POMDPs, mirroring the trajectory we followed: build deterministic infrastructure to deliver clean structured context, add bounded hierarchy when models can exploit delegation, and treat deliberation as a costly capability\-conditional option rather than a universal upgrade\. This is not a universal prescription, as magnitudes are environment\-dependent, and practitioners should validate this ordering in their own settings\. More broadly, our findings suggest that the science of compound AI systems requires studying topology, node internals, and the interaction effects that arise when individually sensible components are composed\. The deliberation cascade identified in this study is one such interaction effect, and designing inter\-agent uncertainty arbitration protocols to prevent it is a promising direction for future work\. Reproducibility details and ethics considerations are in Appendix[A](https://arxiv.org/html/2605.16205#A1)\. The archived artifact is available at[https://doi\.org/10\.5281/zenodo\.19908100](https://doi.org/10.5281/zenodo.19908100); the development repository is available at[https://github\.com/isbogdanov/agent\-design\-study](https://github.com/isbogdanov/agent-design-study)\.

## References

- E\. Bates, V\. Mavroudis, and C\. Hicks \(2023\)Reward shaping for happier autonomous cyber security agents\.InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security \(AISec ’23\),New York, NY, USA,pp\. 221–232\.External Links:[Document](https://dx.doi.org/10.1145/3605764.3623916),[Link](https://doi.org/10.1145/3605764.3623916)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px2.p1.1)\.
- CardiffUni Team \(2022\)CybORG CAGE\-2 Winning Agent: PPO \+ Greedy Decoys\.Note:[https://github\.com/john\-cardiff/\-cyborg\-cage\-2](https://github.com/john-cardiff/-cyborg-cage-2)Accessed: 2026\-04\-28Cited by:[§5\.2\.2](https://arxiv.org/html/2605.16205#S5.SS2.SSS2.p1.4)\.
- S\. R\. Castro, R\. Campbell, N\. Lau, O\. Villalobos, J\. Duan, and A\. A\. Cardenas \(2025\)Large language models are autonomous cyber defenders\.InProceedings of the 2025 IEEE Conference on Artificial Intelligence \(CAI\),pp\. 1125–1132\.External Links:[Document](https://dx.doi.org/10.1109/CAI64502.2025.00195),[Link](https://doi.org/10.1109/CAI64502.2025.00195)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px2.p1.1)\.
- K\. Hammar, N\. Dhir, and R\. Stadler \(2024\)Optimal defender strategies for CAGE\-2 using causal modeling and tree search\.arXiv\.External Links:2407\.11070,[Document](https://dx.doi.org/10.48550/arXiv.2407.11070),[Link](https://arxiv.org/abs/2407.11070)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px2.p1.1)\.
- J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. W\. Yu, X\. Song, and D\. Zhou \(2024\)Large language models cannot self\-correct reasoning yet\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.2310.01798),[Link](https://arxiv.org/abs/2310.01798)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px4.p1.1)\.
- A\. Karpathy \(2025\)\+1 for “context engineering” over “prompt engineering”\.Note:X \(formerly Twitter\) postAccessed 2026\-02\-22External Links:[Link](https://x.com/karpathy/status/1937902205765607626)Cited by:[§1](https://arxiv.org/html/2605.16205#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.16205#S1.p1.1),[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px3.p1.1)\.
- K\. B\. Khaled and D\. Monticolo \(2026\)G2CP: a graph\-grounded communication protocol for verifiable and efficient multi\-agent reasoning\.arXiv\.External Links:2602\.13370,[Document](https://dx.doi.org/10.48550/arXiv.2602.13370),[Link](https://arxiv.org/abs/2602.13370)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px1.p1.1)\.
- M\. Kiely, D\. Bowman, M\. Standen, and C\. Moir \(2023\)On autonomous agents in a cyber defence environment\.arXiv\.External Links:2309\.07388,[Document](https://dx.doi.org/10.48550/arXiv.2309.07388),[Link](https://arxiv.org/abs/2309.07388)Cited by:[§2](https://arxiv.org/html/2605.16205#S2.SS0.SSS0.Px1.p1.3),[§3\.2](https://arxiv.org/html/2605.16205#S3.SS2.SSS0.Px2.p1.4),[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px2.p1.1)\.
- Y\. Kim, K\. Gu, C\. Park, C\. Park, S\. Schmidgall, A\. A\. Heydari, Y\. Yan, Z\. Zhang, Y\. Zhuang, M\. Malhotra, P\. P\. Liang, H\. W\. Park, Y\. Yang, X\. Xu, Y\. Du, S\. Patel, T\. Althoff, D\. McDuff, and X\. Liu \(2025\)Towards a science of scaling agent systems\.arXiv\.External Links:2512\.08296,[Document](https://dx.doi.org/10.48550/arXiv.2512.08296),[Link](https://arxiv.org/abs/2512.08296)Cited by:[§1](https://arxiv.org/html/2605.16205#S1.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.16205#S3.SS1.p1.1),[§6](https://arxiv.org/html/2605.16205#S6.SS0.SSS0.Px4.p1.2),[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px1.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InAdvances in Neural Information Processing Systems,Vol\.35\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2205.11916),[Link](https://arxiv.org/abs/2205.11916)Cited by:[§3\.4](https://arxiv.org/html/2605.16205#S3.SS4.p1.1),[§3\.4](https://arxiv.org/html/2605.16205#S3.SS4.p2.1),[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px4.p1.1)\.
- LangChain \(2025\)LangChain\.Note:Open\-source software frameworkAccessed 2026\-02\-22External Links:[Link](https://github.com/langchain-ai/langchain)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px3.p1.1)\.
- D\. H\. Le and R\. Stadler \(2025\)Learning optimal defender strategies for CAGE\-2 using a POMDP model\.arXiv\.External Links:2509\.06539,[Document](https://dx.doi.org/10.48550/arXiv.2509.06539),[Link](https://arxiv.org/abs/2509.06539)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px2.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2303.17651),[Link](https://arxiv.org/abs/2303.17651)Cited by:[§3\.4](https://arxiv.org/html/2605.16205#S3.SS4.p1.1),[§3\.4](https://arxiv.org/html/2605.16205#S3.SS4.p2.1),[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px4.p1.1)\.
- H\. Mohammadi, J\. J\. Davis, and M\. Kiely \(2025\)Leveraging large language models for autonomous cyber defense: insights from CAGE\-2 simulations\.IEEE Intelligent Systems40,pp\. 29–36\.External Links:[Document](https://dx.doi.org/10.1109/MIS.2025.3568209),[Link](https://doi.org/10.1109/MIS.2025.3568209)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px2.p1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 5687–5711\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.378),[Link](https://arxiv.org/abs/2210.03350)Cited by:[§3\.4](https://arxiv.org/html/2605.16205#S3.SS4.p1.1),[§3\.4](https://arxiv.org/html/2605.16205#S3.SS4.p2.1),[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px4.p1.1)\.
- C\. Qian, Z\. Xie, Y\. Wang, W\. Liu, K\. Zhu, H\. Xia, Y\. Dang, Z\. Du, W\. Chen, C\. Yang, Z\. Liu, and M\. Sun \(2025\)Scaling large language model\-based multi\-agent collaboration\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.2406.07155),[Link](https://arxiv.org/abs/2406.07155)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px1.p1.1)\.
- M\. Renze and E\. Guven \(2024\)Self\-reflection in LLM agents: effects on problem\-solving performance\.arXiv\.External Links:2405\.06682,[Document](https://dx.doi.org/10.48550/arXiv.2405.06682),[Link](https://arxiv.org/abs/2405.06682)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px4.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2303.11366),[Link](https://arxiv.org/abs/2303.11366)Cited by:[§3\.4](https://arxiv.org/html/2605.16205#S3.SS4.p1.1),[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px4.p1.1)\.
- M\. Standen, M\. Lucas, D\. Bowman, T\. J\. Richer, J\. Kim, and D\. Marriott \(2021\)CybORG: a gym for the development of autonomous cyber agents\.arXiv\.External Links:2108\.09118,[Document](https://dx.doi.org/10.48550/arXiv.2108.09118),[Link](https://arxiv.org/abs/2108.09118)Cited by:[Appendix H](https://arxiv.org/html/2605.16205#A8.p1.1),[§2](https://arxiv.org/html/2605.16205#S2.SS0.SSS0.Px1.p1.3)\.
- Y\. Tang, W\. Su, Y\. Zhou, Y\. Liu, M\. Zhang, S\. Ma, and Q\. Ai \(2025\)Augmenting multi\-agent communication with state delta trajectory\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 10219–10240\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.518),[Link](https://aclanthology.org/2025.emnlp-main.518/)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px1.p1.1)\.
- K\. Tran, D\. Dao, M\. Nguyen, Q\. Pham, B\. O’Sullivan, and H\. D\. Nguyen \(2025\)Multi\-agent collaboration mechanisms: a survey of LLMs\.arXiv\.External Links:2501\.06322,[Document](https://dx.doi.org/10.48550/arXiv.2501.06322),[Link](https://arxiv.org/abs/2501.06322)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px1.p1.1)\.
- TTCP CAGE Challenge Working Group \(2022\)TTCP CAGE Challenge 2\.Note:Accessed 2026\-02\-22External Links:[Link](https://github.com/cage-challenge/cage-challenge-2)Cited by:[Appendix A](https://arxiv.org/html/2605.16205#A1.SS0.SSS0.Px1.p1.1),[Appendix H](https://arxiv.org/html/2605.16205#A8.p1.1),[§2](https://arxiv.org/html/2605.16205#S2.SS0.SSS0.Px1.p1.3),[§4](https://arxiv.org/html/2605.16205#S4.SS0.SSS0.Px5.p1.1),[Table 2](https://arxiv.org/html/2605.16205#S4.T2.5.24.18.2.1.1),[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2201.11903),[Link](https://arxiv.org/abs/2201.11903)Cited by:[§3\.4](https://arxiv.org/html/2605.16205#S3.SS4.p1.1),[§3\.4](https://arxiv.org/html/2605.16205#S3.SS4.p2.1),[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px4.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Document](https://dx.doi.org/10.48550/arXiv.2210.03629),[Link](https://arxiv.org/abs/2210.03629)Cited by:[§2](https://arxiv.org/html/2605.16205#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.16205#S3.p1.1)\.
- Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li, U\. Thakker, J\. Zou, and K\. Olukotun \(2025\)Agentic context engineering: evolving contexts for self\-improving language models\.arXiv\.External Links:2510\.04618,[Document](https://dx.doi.org/10.48550/arXiv.2510.04618),[Link](https://arxiv.org/abs/2510.04618)Cited by:[§7](https://arxiv.org/html/2605.16205#S7.SS0.SSS0.Px3.p1.1)\.

Appendix organization\.Appendix[A](https://arxiv.org/html/2605.16205#A1)provides reproducibility details and ethics considerations\. Appendix[B](https://arxiv.org/html/2605.16205#A2)lists the complete YAML definitions for the Planner, Analyst, and ActionChooser\. Appendix[C](https://arxiv.org/html/2605.16205#A3)documents the deliberation tool schemas and activation flags\. Appendix[D](https://arxiv.org/html/2605.16205#A4)reports complete results, cross\-axis comparisons, distributional analysis, and token cost data for all 72 model–configuration pairs\. Appendix[E](https://arxiv.org/html/2605.16205#A5)provides 95% confidence intervals and paired mean\-return difference confidence intervals\. Appendix[F](https://arxiv.org/html/2605.16205#A6)breaks down token consumption and prompt/completion shifts\. Appendix[G](https://arxiv.org/html/2605.16205#A7)provides illustrative trajectory excerpts for the deliberation cascade failure mode\. Appendix[H](https://arxiv.org/html/2605.16205#A8)summarizes CAGE\-2 environment details, network topology, and instance difficulty\. Appendix[I](https://arxiv.org/html/2605.16205#A9)lists the evaluated episode counts per model and configuration\.

## Appendix AReproducibility & Ethics

##### Ethics\.

All authors have read and adhere to the ACM Code of Ethics111https://www\.acm\.org/code\-of\-ethics\. All experiments run within the simulated CybORG CAGE\-2 environment\(TTCP CAGE Challenge Working Group,[2022](https://arxiv.org/html/2605.16205#bib.bib4)\); no real networks, live attack infrastructure, or human\-subject data are involved\. The work is strictly defensive in scope\. LLM tools were used for language polishing and data processing scripts; all design decisions, analyses, and claims are authored by the research team\.

##### Reproducibility\.

All models use deterministic decoding \(temperature 0 or provider minimum\)\. We release the source code, exact YAML configuration snapshots, episode\-allocation metadata, container specification, API\-key template, and experiment runner needed to rerun the evaluated variants\. The paper reports the aggregate results, token accounting, and episode counts used for the main claims, while the artifact provides the implementation and configuration snapshots needed for inspection and selected reruns\. We rely on cross\-model replication of qualitative effects rather than single\-model statistical significance\.

### A\.1\.Artifact Availability and Scope

The artifact contains the agent implementation, experiment runner, container specification, API\-key template, and configuration snapshots needed to rerun evaluated variants\. The main implementation is inagent\_base/, including the CybORG and LLM\-agent coordinators, provider\-connector configuration, logging utilities, and the YAML\-defined Planner, Analyst, and ActionChooser agents\.exp\_configs/contains the twelve paper configurations \(six context, four deliberation, two hierarchy\), each a self\-contained YAML snapshot; switching conditions requires changingdefinitions\_sourceinexperiment\_agent\_eval\.yaml\.

Experiments are launched viarun\_experiment\.py\(10 instances×\\times5 runs = 50 episodes per pair by default\)\. The Dockerfile builds a Python environment with CybORG CAGE\-2 and dependencies; LLM access is supplied through the provided\.env\.template\.

A run creates anexperiments/directory with the copied configuration, per\-instance reports, aggregate summaries, and token\-usage logs\. Full raw LLM\-provider transcripts are not bundled due to storage size; original run seeds are also not included, so the artifact supports executable reruns rather than bit\-for\-bit log regeneration\. Original logs may be available upon request\. The primary reproducibility target is to enable inspection of the implementation and configuration snapshots, and rerunning of selected configurations\.

## Appendix BAgent Definitions

This appendix provides the complete YAML definitions for all three agents \(Planner, Analyst, ActionChooser\), demonstrating the near\-zero\-knowledge starting point described in Section[2](https://arxiv.org/html/2605.16205#S2.SS0.SSS0.Px3)\. Each agent is defined by three files:core\.yaml\(identity, model binding, tool flags\),initial\_prompt\.yaml\(per\-step prompt template\), andpersistent\_knowledge\.yaml\(domain knowledge\)\. A sharedcommon\_knowledge\.yamlapplies to all agents\. All remaining knowledge files,reflection\_knowledge\.yaml, reflection\_examples\.yaml, andexamples\.yaml, are confirmed empty for every agent in every experiment reported in this paper; the learning subsystem that would populate them is disabled throughout\. The boolean flagsinclude\_tool\_raise\_a\_question,include\_tool\_critique\_the\_answer, and include\_tool\_improve\_based\_on\_critiquein each agent’s core\.yamlare set tofalsein the baseline configuration and toggled totruecumulatively for the deliberation axis \(Section[3\.4](https://arxiv.org/html/2605.16205#S3.SS4)\)\. The YAML shown below reflects the baseline \(anchor\) configuration; deliberation variants differ only in these three flags and in the optionalinclude\_COT\_instructioninjection\.

### B\.1\.Shared Configuration

##### common\_knowledge\.yaml\.

```
,
reflection_knowledge:
```

This file is empty, confirming that no shared domain heuristics are injected across agents\.

### B\.2\.Planner

##### planner/core\.yaml\.

```
agent_type: "ReAct"
include_tool_raise_a_question: false
include_tool_critique_the_answer: false
include_tool_improve_based_on_critique: false
include_COT_instruction: false

add_examples: false
add_reflection_examples: true

system_message: |
  You are the strategic decision-maker for the
  Blue team’s cybersecurity defense. Your role is
  to select ONE action per step that best defends
  the network.

  You must balance immediate threats with strategic
  positioning, considering that you can only act
  once per step.

  You MUST follow your <reflection_rules> during
  each thought phase.

  Provide a final answer to the question asked
  by the USER.

COT_instruction: |
  You MUST follow your Chain-of-Thought instructions
  during each thought phase.
  1. REVIEW SITUATION: Check network status and step
     history. Which hosts need attention? What actions
     have been tried?
  2. IDENTIFY TARGET: Select the most critical host or
     threat to address this step
  3. GATHER INFO: If needed, use
     get_analysis_of_host_update for detailed analysis
     of a changed host
  4. GET SUGGESTIONS: Call
     get_suggestion_for_next_action with JSON:
     {"target_host": "hostname", "situation":
     "description", "severity": "level", "context":
     "relevant history"}
  5. DECIDE: Choose ONE action from suggestions. You
     may override based on strategic reasoning

rules:
  - You must select ONLY ONE action for your final
    Answer from the list of suggestions provided by
    the ’get_suggestion_for_next_action’ tool
  - Your final Answer MUST be a verbatim copy of the
    action-string from ONE of the suggestions
  - TOOLS CANNOT HANDLE MULTIPLE HOSTS, YOU MUST
    SELECT ONLY ONE SPECIFIC HOST AT A TIME

tools:
  - name: "get_analysis_of_host_update"
    description: "Provides a detailed analysis and
      comparison of the state change of the specific
      host to its baseline"
    example_calling:
      "get_analysis_of_host_update: User4"

  - name: "get_suggestion_for_next_action"
    description: |
      Suggests a ranked list of final actions, each
      with a justification and confidence score based
      on recent analysis and action history.

      CRITICAL INPUT REQUIREMENT: Your input MUST be
      a valid JSON object with these required keys:
      - "target_host": The specific single hostname
      - "situation": Brief description of the threat
      - "severity": Threat level
        (low|medium|high|critical)
      - "context": Relevant information from
        previous steps
    example_calling: >
      get_suggestion_for_next_action:
      {"target_host": "Enterprise1", "situation":
      "critically compromised with active C2",
      "severity": "critical", "context": "Remove
      action failed previously"}
    is_critical: true
```

The Planner’s two domain\-specific tools, get\_analysis\_of\_host\_update andget\_suggestion\_for\_next\_action, are the interfaces through which theCoordinatorspawns the Analyst and ActionChooser sub\-agents in hierarchical configurations \(hier\-baseandhier\-delib\)\. When delegation is disabled, the Planner does not invoke these tools and instead emits an environment action directly\.

##### planner/initial\_prompt\.yaml\.

```
prompt:
  opening: |
    You are at step {step_number} now.

    {network_status}

    {history}

  closing: |
    What action should be taken next?
```

The placeholders\{network\_status\},\{history\}, and \(in context variants that include it\)\{observation\}are populated deterministically at each step by the environment\-state layer described in Section[3\.2](https://arxiv.org/html/2605.16205#S3.SS2)\. The anchor configuration includes\{network\_status\}and\{history\}but omits\{observation\}\.

##### Context output examples\.

The\{network\_status\}placeholder is rendered as a JSON list of all non\-baseline hosts, annotated with current status, recency, and action history:

```
Network Status: The following hosts have
updates or are in a non-baseline state:
[
  {"host_name": "Enterprise1",
   "current_status": "changed",
   "time_of_update": "Current",
   "applied_actions_so_far": "Analyse->Remove"},
  {"host_name": "User4",
   "current_status": "unknown",
   "time_of_update": "Past",
   "applied_actions_so_far": "Remove"}
]
```

When all hosts are healthy, a single sentence reports baseline status, keeping the prompt compact\. The\{history\}placeholder is rendered as a compressed action log with smart collapsing: consecutive quiet steps are folded into ranges when no state changes are detected, while steps involving interventions retain full detail:

```
Steps 1-3: Action: Monitor/No action needed.
  (No state changes observed)
Step 4: {"action": "Analyse Enterprise1",
  "analysis": "Host shows suspicious processes"}
Step 5: {"action": "Remove Enterprise1"}
Step 6: Action: Monitor
```

##### planner/persistent\_knowledge\.yaml\.

```
reflection_knowledge:
  - content:
    - Description: monitoring network for malicious
        activity.
      Example use: Monitor
      Name: Monitor
      Type: Passive observing action
    - Description: it does not remove the host but
        attempts to remove malicious infection from
        a host.
      Example use: Remove hostname=someName
      Name: Remove
      Type: Reactive intervention
    - Description: analyzing a host for malicious
        activity at deeper system level
      Example use: Analyse hostname=someName
      Name: Analyse
      Type: Passive observing action
    - Description: restoring a host to a clean state
        with very high penalty, this action removes
        all previously deployed decoys from the host
      Example use: Restore hostname=someName
      Name: Restore
      Type: Reactive intervention
    - Description: deploying a service to act as a
        decoy to a host to distract the attacker
        in future steps
      Example use: DecoySERVICE hostname=someName
      Name: DecoySERVICE
      Type: Proactive Protective action
    header: >
      INTERPRETATION OF ACTIONS FROM
      <AVAILABLE_ACTIONS_LIST>
    root: actions
    type: json
```

This is the only domain knowledge provided to the Planner \(and, identically, to the ActionChooser\): a five\-entry action\-type glossary with names, types, and usage syntax\. No tactical heuristics, no threat\-assessment rubrics, and no worked examples are included\.

##### Empty knowledge files\.

The following files are empty for all Planner experiments:examples\.yaml,reflection\_examples\.yaml, andreflection\_knowledge\.yaml\.

### B\.3\.Analyst

##### analyst/core\.yaml\.

```
agent_type: "ReAct"
include_tool_raise_a_question: false
include_tool_critique_the_answer: false
include_tool_improve_based_on_critique: false
include_COT_instruction: false

add_examples: false
add_reflection_examples: true

system_message: |
  You are a cybersecurity analyst.
  You MUST follow your <reflection_rules> during
  each thought phase.
  Provide a final answer to the question asked
  by the USER.

COT_instruction: |
  You MUST follow your Chain-of-Thought instructions
  during each thought phase.
  1. GET CURRENT STATE: Use get_host_current_state
     for the target host
  2. GET BASELINE: Use get_host_baseline_state to
     compare against initial state
  3. IDENTIFY ANOMALIES: What changed? New processes,
     connections, missing services?
  4. ASSESS SEVERITY: How critical is this compromise?
     Is there C2 activity?
  5. RECOMMEND ACTION: Should we contain, investigate
     further, or just monitor?

tools:
  - name: "get_host_current_state"
    description: "Get the current state details for a
      specific host. The input must be a single
      hostname."
    example_calling: "get_host_current_state:
      Enterprise1"

  - name: "get_host_baseline_state"
    description: "Get the baseline state details for
      a specific host. The input must be a single
      hostname."
    example_calling: "get_host_baseline_state:
      Enterprise1"

answer_format: |
  Your response MUST STRICTLY be a JSON array of
  objects that follows the following schema:
  {
    "host": "<n>",
    "status": "baseline|changed|unknown",
    "severity": "low|medium|high|critical",
    "anomalies": ["<short bullet>", "..."],
    "suspected_c2": ["<ip/host>"],
    "recommended_focus":
      "containment|investigation|monitor",
    "rationale": "<2-3 sentences>"
  }
```

The Analyst receives a one\-sentence role identity and two read\-only tools \(get\_host\_current\_state,get\_host\_baseline\_state\) that return deterministic JSON from the environment\-state layer\. Its output is a structured JSON assessment with a constrained schema\. No domain heuristics are provided: itspersistent\_knowledge\.yamland all example files are empty\.

##### analyst/initial\_prompt\.yaml\.

```
prompt:
  opening: |
    You must provide an analysis for the host:
    {hostname}.
  closing: |
    What is your understanding of the situation
    with the host?
```

### B\.4\.ActionChooser

##### action\_chooser/core\.yaml\.

```
agent_type: "ReAct"
include_tool_raise_a_question: false
include_tool_critique_the_answer: false
include_tool_improve_based_on_critique: false
include_COT_instruction: false

add_examples: false
add_reflection_examples: true

system_message: |
  You are a tactical cybersecurity advisor for the
  Blue team.

  You MUST follow your <reflection_rules> during
  each thought phase.

  Provide a final answer to the question asked
  by the USER.

COT_instruction: |
  You MUST follow your Chain-of-Thought instructions
  during each thought phase.
  1. READ SITUATION: Check SITUATION_JSON for
     target_host, threat description, severity,
     and context
  2. EVALUATE ACTIONS: Consider available action
     types and their costs vs benefits
  3. RANK THREE: Provide three suggestions with
     confidence scores (0.0-1.0), highest confidence
     first

answer_format: |
  Your response MUST STRICTLY be a JSON array of
  objects, where each object represents a suggested
  action. Each object must have ONLY the following
  keys: "action", "confidence".
```

The ActionChooser has*no tools*, it is a pure generation agent that receives a situation JSON and returns a ranked action list\. Itspersistent\_knowledge\.yamlcontains the same five\-entry action glossary as the Planner \(reproduced above\)\. All example files are empty\.

##### action\_chooser/initial\_prompt\.yaml\.

```
prompt: |
  <SITUATION_JSON>
  {situational_understanding}
  </SITUATION_JSON>

  The JSON above contains:
  - "target_host": The specific hostname requiring
    action (CRITICAL: All your suggested actions
    MUST target this host)
  - "situation": Description of the current threat
  - "severity": The threat level
    (low|medium|high|critical)
  - "context": Additional relevant information
    from previous steps

  Provide THREE action suggestions for the
  target_host specified in the JSON, ordered from
  highest confidence to lowest.
```

## Appendix CDeliberation Tool Schemas

The deliberation tools are three generic self\-critique operations implemented in the sharedBaseToolExecutorclass and inherited by all agent types\. They are toggled via boolean flags in each agent’score\.yaml; the cumulative activation sequence defines the four deliberation levels in Axis 2 \(Section[3\.4](https://arxiv.org/html/2605.16205#S3.SS4)\)\.

##### Tool activation sequence\.

Table[5](https://arxiv.org/html/2605.16205#A3.T5)shows the cumulative activation of each flag\.

Table 5\.Deliberation tool activation by experimental level\. Each level cumulatively adds capabilities;\+COTadds all three tools plus an explicit chain\-of\-thought system prompt injection\.
##### raise\_a\_question\.

When invoked, the agent formulates a question directed at itself\. The tool returns the question text as an ”observation” prompting the agent to answer it in the next reasoning turn\. This creates a self\-interrogation loop: the agent pauses its action\-selection process to surface uncertainties or alternative framings\. The tool accepts free\-text input \(the question\) and returns a formatted prompt:*”You have chosen ask yourself:\{question\}\. What can you answer to yourself?”*

##### critique\_the\_answer\.

The agent submits its current reasoning as input\. The tool returns this text as a critique prompt, asking the agent to evaluate its own logic:*”You have chosen to critique your reasoning: ‘\{reasoning\}’\. Now, provide your critical assessment\.”*This forces a second pass over the agent’s tentative conclusion before action commitment\.

##### improve\_based\_on\_critique\.

After critiquing, the agent submits an improved version of its reasoning\. The tool returns:*”You have decided to improve your answer based on the critique: ‘\{improved\_reasoning\}’\. Now, provide your final answer\.”*This creates a three\-phase deliberation cycle: question→\\tocritique→\\toimprove\.

##### COT instruction injection\.

The\+COTlevel does not add a fourth tool\. Instead, it setsinclude\_COT\_instruction=true, which injects theCOT\_instructionblock from the agent’score\.yaml\(shown in Appendix[B](https://arxiv.org/html/2605.16205#A2)\) into the system prompt\. This provides explicit step\-by\-step reasoning guidance tailored to each agent’s role\. Combined with the three tools,\+COTrepresents the maximum deliberation configuration\.

##### Scope of distribution\.

Inhier\-base, only the Planner’s deliberation flags are toggled \(the Analyst and ActionChooser retainfalsefor all flags\)\. Inhier\-delib, the same flags are toggled for*all three agents*, creating the distributed deliberation condition studied in Section[5\.2\.3](https://arxiv.org/html/2605.16205#S5.SS2.SSS3)\.

## Appendix DComplete Results

This appendix presents supplementary results for all 72 model–configuration pairs, organized by experimental axis\. Figure[6](https://arxiv.org/html/2605.16205#A4.F6)compares each axis’s best configuration against the shared anchor\.

![Refer to caption](https://arxiv.org/html/2605.16205v1/fig05_headline_overview.png)Figure 6\.Best configuration per axis compared to the shared anchor configuration\. For most models, the largest absolute improvement comes from hierarchy \(Axis 3\), but context engineering \(Axis 1\) achieves competitive gains at a fraction of the token cost\.Grouped bar chart comparing the anchor configuration return against the best configuration on each axis for all six models\.### D\.1\.Context Engineering

Table[6](https://arxiv.org/html/2605.16205#A4.T6)provides standard deviations complementing the main\-text hero table \(Table[3](https://arxiv.org/html/2605.16205#S5.T3)\)\. Table[7](https://arxiv.org/html/2605.16205#A4.T7)reports the marginal value of adding or removing individual context components\. Figures[7](https://arxiv.org/html/2605.16205#A4.F7)and[8](https://arxiv.org/html/2605.16205#A4.F8)visualize context component marginal gains and the raw\-observation penalty\. Figures[9](https://arxiv.org/html/2605.16205#A4.F9)and[10](https://arxiv.org/html/2605.16205#A4.F10)show context component interactions\.

![Refer to caption](https://arxiv.org/html/2605.16205v1/fig09_marginal_value.png)Figure 7\.Marginal value of adding individual context components\. Positive values indicate improvement\. Adding\{network\_status\}to raw observation delivers the largest consistent gains\.Bar chart of marginal improvement from adding each context component, grouped by model\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig11_raw_obs_penalty.png)Figure 8\.Raw observation penalty\. Gap betweenobs\-only and the structuredhist\+netanchor configuration per model\. Longer bars indicate larger benefit from replacing raw observations with programmatic context\.Horizontal bar chart comparing obs\-only and hist\+net returns for each model\.Table 6\.Full results with standard deviation\. Mean episode return \(±\\pmstandard deviation\) across all 72 model–configuration pairs, complementing the token\-cost data in Table[3](https://arxiv.org/html/2605.16205#S5.T3)\. Configurations are grouped by experimental axis\.Table 7\.Context component marginal value\. Each row shows the percentage improvement from adding or removing one context component\. Positive values indicate improvement \(return moves toward zero\)\. Transitions with≥\\geq30% improvement arebolded\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig22_context_waterfall.png)Figure 9\.Context component waterfall\. Additive effect of history and network status on top of raw observation\. Green = improvement, red = degradation\.Context component waterfall per model\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig10_context_interaction.png)Figure 10\.Context component interaction\. obs\+net compared to obs\-only and network\-only\. Synergy \(Δ\\Delta\) shows whether combining exceeds the better individual\.Context component synergy analysis\.
### D\.2\.Hierarchy and Architecture Summary

Table[8](https://arxiv.org/html/2605.16205#A4.T8)details the hierarchy degradation ratios\. Table[9](https://arxiv.org/html/2605.16205#A4.T9)compares deliberation across monolithic and hierarchical settings\. Table[10](https://arxiv.org/html/2605.16205#A4.T10)quantifies the architectural impact range per model\. Figure[11](https://arxiv.org/html/2605.16205#A4.F11)shows model fingerprints across axes\. Figure[12](https://arxiv.org/html/2605.16205#A4.F12)visualizes the deliberation cascade penalty and Figure[13](https://arxiv.org/html/2605.16205#A4.F13)shows the hierarchy degradation pattern\.

Table 8\.Hierarchy results\. Mean return, standard deviation, and tokens per episode for both hierarchy configurations, plus the degradation ratio \(hier\-delib/hier\-base; values\>\>1 indicate degradation from adding distributed deliberation\)\.Table 9\.Deliberation in monolithic vs\. hierarchical settings\. Compares the shared anchor configuration, best monolithic deliberation level, and both hierarchy configurations\. For most models,hier\-basematches or exceeds the best monolithic deliberation at comparable cost, whilehier\-delibdegrades it\.Table 10\.Architectural impact range per model\. Best and worst configurations across all twelve, with the return gap quantifying the maximum leverage of architectural choices within each model family\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig06_radar.png)Figure 11\.Model fingerprints\. Each radar shows normalized best performance on three axes \(context, deliberation, hierarchy\)\. Larger area indicates better overall performance\. Models exhibit distinct capability profiles: Grok and G3FP excel across axes, while G2\.5FL is uniformly weak\.Six radar chart subplots, one per model\. Each has three spokes for context, deliberation, and hierarchy\. Filled areas show normalized best scores\. Grok and G3FP have the largest areas; G2\.5FL the smallest\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig21_metacognition_penalty.png)Figure 12\.The Deliberation Cascade Penalty\. Slope chart showing the shift in mean return when moving from monolithic deliberation to hierarchical deliberation\. Red = degradation\.Deliberation cascade penalty slope chart\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig15_hier_degradation.png)Figure 13\.Hierarchy degradation\. Performance change when distributing deliberation tools across the hierarchy, showing model\-specific sensitivity to the deliberation cascade\.Hierarchy degradation visualization comparing hier\-base and hier\-delib performance across models\.
### D\.3\.Deliberation

Tables[11](https://arxiv.org/html/2605.16205#A4.T11)and[12](https://arxiv.org/html/2605.16205#A4.T12)break down the deliberation axis with per\-level token consumption\. Table[13](https://arxiv.org/html/2605.16205#A4.T13)shows the return\-on\-investment for each deliberation level\. Figures[14](https://arxiv.org/html/2605.16205#A4.F14),[15](https://arxiv.org/html/2605.16205#A4.F15), and[16](https://arxiv.org/html/2605.16205#A4.F16)visualize ROI, capability correlation, and the reasoning ceiling effect\. Figure[17](https://arxiv.org/html/2605.16205#A4.F17)shows the performance trajectory across cumulative deliberation levels\.

Table 11\.Deliberation results \(Grok, Llama, Devstral\)\. Mean episode return, standard deviation, and tokens per episode across five deliberation levels\. Best return per model isbolded\.Table 12\.Deliberation results \(Qwen, G2\.5FL, G3FP\)\. Mean episode return, standard deviation, and tokens per episode across five deliberation levels\. Best return per model isbolded\.Table 13\.Deliberation return on investment\. Shows the reward change \(Δ\\Delta\) and token increase \(Δ\\DeltaTok\) relative to the planner\-only anchor for each reasoning level\. PositiveΔ\\Delta= improvement\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig12_sr_roi_scatter.png)Figure 14\.Deliberation ROI\. Each point shows one model–level pair’s change in return and tokens relative to anchor\. Upper\-left = efficient improvement\.Deliberation ROI scatter plot\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig13_capability_correlation.png)Figure 15\.Capability correlation\. Models with stronger baselines \(right\) benefit less from deliberation on average\. Spearman correlation shown\.Baseline capability vs deliberation benefit\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig23_reasoning_ceiling.png)Figure 16\.The reasoning ceiling\. Models with higher baseline capabilities tend to peak at lower levels of deliberation before degrading\.Reasoning ceiling effect\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig02_reasoning_progression.png)Figure 17\.Deliberation progression\. Performance trajectory across cumulative deliberation levels for each model, showing non\-monotonic patterns and model\-dependent ceilings\.Deliberation progression across cumulative levels for all six models\.
### D\.4\.Cross\-Axis Comparisons

Table[14](https://arxiv.org/html/2605.16205#A4.T14)compares the observation\-only baseline against each model’s best and worst configuration\. Tables[17](https://arxiv.org/html/2605.16205#A4.T17),[18](https://arxiv.org/html/2605.16205#A4.T18), and[19](https://arxiv.org/html/2605.16205#A4.T19)identify best and worst configurations within each axis\. Table[15](https://arxiv.org/html/2605.16205#A4.T15)and Figure[18](https://arxiv.org/html/2605.16205#A4.F18)present pairwise win rates; Figure[19](https://arxiv.org/html/2605.16205#A4.F19)provides a head\-to\-head win\-rate matrix using each model’s peak configuration\. Table[16](https://arxiv.org/html/2605.16205#A4.T16)summarizes performance by configuration group\. Figure[20](https://arxiv.org/html/2605.16205#A4.F20)and Table[20](https://arxiv.org/html/2605.16205#A4.T20)demonstrate ranking stability across axes\. Table[21](https://arxiv.org/html/2605.16205#A4.T21)reports anchor\-configuration performance\.

Table 14\.Baseline Performance \(Planner \+ Observation Only\)\. The simplest configuration compared against each model’s best and worst overall configuration across all 12 options\.Table 15\.Pairwise Win Rates: Anchor vs\. Full Deliberation \(\+COT\)\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig20_win_rates.png)Figure 18\.Pairwise win rates\. Anchor vs \+COT compared instance\-by\-instance\. Blue = anchor wins, orange = \+COT wins, gray = ties\.Pairwise win rates visualization\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig31_global_win_matrix.png)Figure 19\.Global head\-to\-head win\-rate matrix\. Compares the peak configuration of every model against every other model on a per\-instance basis\.Global win\-rate matrix heatmap\.Table 16\.Performance by config group\. Mean and range of mean episode return across all models for each configuration group\.Table 17\.Context Engineering: Best and Worst Configurations per model\.Table 18\.Deliberation: Best and Worst Configurations per model\.Table 19\.Hierarchy: Best and Worst Configurations per model\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig19_ranking_stability.png)Figure 20\.Model ranking stability\. Lines connect each model’s best\-config rank across the three axes\. Flat lines = consistent relative performance\.Model ranking stability across axes\.Table 20\.Model rankings by config group \(1 = best\)\.Table 21\.Anchor performance across all models\.
### D\.5\.Distributional Analysis

Table[22](https://arxiv.org/html/2605.16205#A4.T22)reports standard deviation and worst\-case \(minimum\) episode return for every model–configuration pair, supporting the robustness analysis in Section[5\.4](https://arxiv.org/html/2605.16205#S5.SS4)\. Table[23](https://arxiv.org/html/2605.16205#A4.T23)reports catastrophic failure rates \(return<−150<\-150\) for each configuration\. Figures[21](https://arxiv.org/html/2605.16205#A4.F21),[22](https://arxiv.org/html/2605.16205#A4.F22),[23](https://arxiv.org/html/2605.16205#A4.F23),[24](https://arxiv.org/html/2605.16205#A4.F24), and[25](https://arxiv.org/html/2605.16205#A4.F25)provide distributional views of episode returns across design axes\.

Table 22\.Variance and tail risk\. Standard deviation and minimum \(worst\-case\) episode return for each model–configuration pair\.Table 23\.Catastrophic failure rates\. Percentage of episodes with return below−150\-150\(indicating near\-total network compromise\)\. Lower is better\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig16_score_cdf.png)Figure 21\.Cumulative distributions by axis\. Threshold lines mark failure severity\. Right\-shifted curves indicate better tail behavior\.CDF by design axis\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig08_group_boxplot.png)Figure 22\.Score distributions by design axis\. Violins show full distribution with mean and median\. Context configs have the widest spread\.Score distributions by design axis\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig25_score_distribution_ridges.png)Figure 23\.Score distribution shift\. Density ridges comparing the continuous probability distribution of episode returns for the Anchor baseline \(blue\) versus \+COT \(orange\)\.Density ridges anchor vs COT\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig24_risk_reward.png)Figure 24\.Risk vs\. Reward stability frontier\. Shifts show how adding \+COT changes both the mean return and variance\. Ideally, arrows move up and to the left\.Risk\-reward frontier\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig26_outcome_breakdown.png)Figure 25\.Outcome breakdown for the \+COT configuration\. Shows the percentage of episodes resulting in success, mediocre failure, or catastrophic failure\.Outcome breakdown per model for \+COT\.
### D\.6\.Token Cost Progression

Figure[26](https://arxiv.org/html/2605.16205#A4.F26)shows the token cost progression from cheapest \(obs\) to most expensive \(hier\-delib\) configuration\.

![Refer to caption](https://arxiv.org/html/2605.16205v1/fig07_token_waterfall.png)Figure 26\.Token cost progression from cheapest \(obs\) to most expensive \(hier\-delib\)\. Deliberation and hierarchy dramatically increase token consumption; the deliberation cascade represents the cost ceiling\.Bar chart showing token cost per episode from cheapest to most expensive configuration type\.

## Appendix EStatistical Support

This appendix provides 95% confidence intervals for mean episode returns \(Table[24](https://arxiv.org/html/2605.16205#A5.T24)\) and paired mean\-return differences for key comparisons \(Table[25](https://arxiv.org/html/2605.16205#A5.T25)\)\.

Table 24\.Full results matrix with 95% confidence intervals\. Mean episode return \(±\\pmCI half\-width\) across all 72 model–configuration pairs\. Best point\-estimate return per model isbolded\. Configurations are grouped by experimental axis\.Table 25\.Paired mean\-return differences with 95% confidence intervals\. Each cell showsΔ\\Delta= mean paired difference±\\pmCI half\-width \(matched by instance×\\timesrun; duplicate episodes averaged before pairing\)\.Boldindicates the 95% CI excludes zero; positiveΔ\\Deltameans the first\-named configuration yields higher return\.†Post\-hoc selected:best\-ctxis the highest\-return structured context per model \(excluding rawobs\);best\-SRis the highest\-return monolithic self\-reasoning level \(\+question…\+COT, excluding thehist\+netanchor\)\.
## Appendix FToken Consumption

This appendix details per\-model token profiles and token efficiency across configurations\. Tables[26](https://arxiv.org/html/2605.16205#A6.T26)through[31](https://arxiv.org/html/2605.16205#A6.T31)report a scalar*shifted return\-per\-kilotoken*efficiency,G~/KTok\\tilde\{G\}/\\mathrm\{KTok\}, whereG~=225\+G\\tilde\{G\}=225\+Gconverts the non\-positive episodic returnGGinto a non\-negative ”defense score” \(higher is better\), andKTok\\mathrm\{KTok\}is tokens per episode in thousands\. This scalar is provided as a compact summary; our primary cost–performance comparisons use Pareto frontiers in the main text\. We also performed a pricing sensitivity check by re\-weighting token costs using provider\-specific input/output pricing ratios\. This re\-weighting did not reverse any qualitative conclusion; it narrowed the relative cost advantage of context over hierarchy because simpler context configurations contain a higher share of output tokens, but the qualitative ordering was preserved\.

Table[32](https://arxiv.org/html/2605.16205#A6.T32)breaks down the prompt vs\. completion token split for the anchor baseline and the \+COT configuration\. Deliberation dramatically increases prompt tokens \(due to multi\-turn tool\-call exchanges\) and moderately increases completion tokens\. The prompt\-to\-completion ratio shifts from roughly22–24×24\\timesat baseline to44–31×31\\timesunder \+COT, indicating that deliberation overhead is dominated by the expansion of the conversational context rather than by longer model outputs\.

Figure[27](https://arxiv.org/html/2605.16205#A6.F27)visualizes the prompt/completion breakdown\. Figure[28](https://arxiv.org/html/2605.16205#A6.F28)shows the exponential increase in token consumption as deliberation levels are added\. Figure[29](https://arxiv.org/html/2605.16205#A6.F29)plots the token\-cost multiplier of \+COT against its per\-instance win rate over the anchor: models where \+COT achieves high win rates \(Grok, G2\.5FL\) pay 4–5×\\timesmore tokens, while models where \+COT is harmful \(Llama\) pay a similar multiplier for worse outcomes\.

Table 26\.Token efficiency: Grok\. Shifted return\-per\-KToken \(G~/KTok\\tilde\{G\}/\\mathrm\{KTok\}, higher is better\), whereG~=225\+G\\tilde\{G\}=225\+Gconverts the non\-positive returnGGinto a non\-negative defense score\. The most efficient configuration by this scalar isbolded\(Pareto efficiency is analyzed in the main text\)\.Table 27\.Token efficiency: Llama\. Shifted return\-per\-KToken \(G~/KTok\\tilde\{G\}/\\mathrm\{KTok\}, higher is better\), whereG~=225\+G\\tilde\{G\}=225\+Gconverts the non\-positive returnGGinto a non\-negative defense score\. The most efficient configuration by this scalar isbolded\(Pareto efficiency is analyzed in the main text\)\.Table 28\.Token efficiency: Devstral\. Shifted return\-per\-KToken \(G~/KTok\\tilde\{G\}/\\mathrm\{KTok\}, higher is better\), whereG~=225\+G\\tilde\{G\}=225\+Gconverts the non\-positive returnGGinto a non\-negative defense score\. The most efficient configuration by this scalar isbolded\(Pareto efficiency is analyzed in the main text\)\.Table 29\.Token efficiency: Qwen\. Shifted return\-per\-KToken \(G~/KTok\\tilde\{G\}/\\mathrm\{KTok\}, higher is better\), whereG~=225\+G\\tilde\{G\}=225\+Gconverts the non\-positive returnGGinto a non\-negative defense score\. The most efficient configuration by this scalar isbolded\(Pareto efficiency is analyzed in the main text\)\.Table 30\.Token efficiency: G2\.5FL\. Shifted return\-per\-KToken \(G~/KTok\\tilde\{G\}/\\mathrm\{KTok\}, higher is better\), whereG~=225\+G\\tilde\{G\}=225\+Gconverts the non\-positive returnGGinto a non\-negative defense score\. The most efficient configuration by this scalar isbolded\(Pareto efficiency is analyzed in the main text\)\.Table 31\.Token efficiency: G3FP\. Shifted return\-per\-KToken \(G~/KTok\\tilde\{G\}/\\mathrm\{KTok\}, higher is better\), whereG~=225\+G\\tilde\{G\}=225\+Gconverts the non\-positive returnGGinto a non\-negative defense score\. The most efficient configuration by this scalar isbolded\(Pareto efficiency is analyzed in the main text\)\.Table 32\.Token Profile Shift\. Average prompt and completion tokens per episode for the Anchor baseline vs\. \+COT\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig14_token_profile.png)Figure 27\.Token profile shift\. Stacked bars show prompt \(solid\) vs completion \(hatched\) for anchor and \+COT\. Deliberation increases both components\.Token profile shift visualization\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig27_token_velocity.png)Figure 28\.Token generation velocity\. Shows the exponential increase in token usage as cumulative deliberation capabilities are added\.Token velocity across reasoning levels\.![Refer to caption](https://arxiv.org/html/2605.16205v1/fig29_cost_vs_winrate.png)Figure 29\.Cost vs\. Win Rate\. Plots the token cost multiplier of using \+COT against the resulting win rate against the Anchor baseline\.Cost vs win rate scatter\.
## Appendix GTrajectory Examples

We present a paired trajectory comparison from Devstral on instance 7, illustrating the deliberation cascade failure mode described in Section[5\.2\.3](https://arxiv.org/html/2605.16205#S5.SS2.SSS3)\. Both episodes use identical model weights, environment seed, andhist\+netcontext\. The only difference is whether deliberation tools are distributed across the hierarchy\.

##### Episode identification\.

Thehier\-baseepisode achieves a return of−13\.3\-13\.3\(near\-optimal defense\); thehier\-delibepisode achieves−211\.2\-211\.2\(near\-total network loss\), a15\.9×15\.9\\timesdegradation on the same environment instance\. Table[33](https://arxiv.org/html/2605.16205#A7.T33)summarizes the behavioral contrast\.

Table 33\.Behavioral comparison: Devstral instance 7,hier\-basevs\.hier\-delib\. The only architectural difference is the distribution of deliberation tools across the hierarchy\.Metrichier\-basehier\-delibEpisode return−13\.3\-13\.3−211\.2\-211\.2Restore actions121Remove actions50Decoy deployments013Monitor actions37Delib\. tool calls \(Planner\)011AC msgs / invocation2\.06\.6Analyst invocations145Steps at≤−13\\leq\\\!\-13/step016First Restore actionstep 4step 25
##### Mechanism: distributed caution bias\.

The trajectories reveal a failure mode we term*distributed caution bias*, a concrete instantiation of the cascading\-uncertainty mechanism discussed in Section[5\.2\.3](https://arxiv.org/html/2605.16205#S5.SS2.SSS3)\. When each agent independently runs a deliberation cycle \(raise\_a\_question→\\tocritique→\\toimprove\), each cycle independently converges on the conservative option: investigate further, deploy passive defenses, defer reactive intervention\. No individual agent’s reasoning is incorrect, the Analyst’s hedge is reasonable, the ActionChooser’s caution is defensible, the Planner’s self\-questioning is thoughtful\. But the*composition*of three independently cautious agents produces a system that never intervenes until catastrophic damage has accumulated\. The failure is structural: the absence of an urgency\-propagation mechanism that would override local caution when system\-level evidence demands immediate action\.

##### Showcase: Step 18 \(hier\-delib\)\.

At step 18, the network has been losing−13\.1\-13\.1reward per step since step 15 \(Op\_Server0 is compromised\)\. The Planner’s reasoning trace proceeds as follows\. \(Excerpts are lightly trimmed for space\)

```
[Step 18, hier-delib, reward: -13.1]

Planner (thought): User4 has been analyzed twice
  and has decoys deployed. The other hosts have
  similar setups. The next strategic step is to
  ensure all hosts are consistently protected...
Planner -> raise_a_question: "Should I focus on
  monitoring the network or re-analyzing User4?"
Planner (thought): I need to critique my own
  reasoning to ensure I’m making the best decision.
Planner -> critique_the_answer: "My initial thought
  is to either monitor or re-analyze User4.
  However, I should consider whether there are
  other hosts that might need attention..."
Planner -> improve_based_on_critique: "Monitoring
  the network is the more strategic action. It
  will provide a broader view of potential threats
  and ensure all hosts are observed for anomalies."
Planner -> get_suggestion_for_next_action:
  {"target_host": "User4", "severity": "low", ...}
  ActionChooser (8 msgs, own deliberation cycle):
    -> raise_a_question: "What would be the
       appropriate strategy for User4?"
    -> critique: "Have I considered the potential
       need for reactive interventions?"
    -> improve: "Balance passive and proactive..."
    Returns: [Monitor (high), Analyse (medium),
              Restore (low)]
Planner (final): Answer: Monitor
```

The Planner uses three deliberation tools before consulting the ActionChooser\. The ActionChooser runs its own three\-phase deliberation cycle\. Ten messages are exchanged to arrive at ”Monitor”, a passive action, while the network loses 13\.1 reward points\. The Analyst is not consulted at all at this step\.

##### Contrast: Step 18 \(hier\-base\)\.

In the same episode phase, thehier\-baseagent operates with a clean network \(all hosts at baseline, reward=0\.0=0\.0\) because it contained threats via timely Restore actions at steps 4, 6, 8, 11, and 14\. Its step\-18 trace is four messages:

```
[Step 18, hier-base, reward: 0.0]

Planner (thought): Network is stable, all hosts in
  baseline state. Enterprise2 was recently restored;
  analyzing it again could confirm its integrity.
Planner -> get_suggestion_for_next_action:
  {"target_host": "Enterprise2",
   "situation": "restored, confirm integrity",
   "severity": "low", ...}
  ActionChooser (2 msgs): Returns:
    [Analyse (high), DecoySSHD (medium),
     Restore (low)]
Planner (final):
  Answer: Analyse hostname=Enterprise2
```

No deliberation tools are invoked\. The Planner identifies the most useful proactive action and executes it in a single delegation round\. The ActionChooser returns its recommendation in one exchange without a critique cycle\.

##### The critical divergence\.

The behavioral divergence begins at step 7\. Both systems detect changes on Enterprise0 and User4\. Thehier\-basePlanner, at steps 3–8, follows a tight detect–analyse–restore cycle: upon finding C2 indicators \(connection to port 4444, backdoor process\), it immediately requests suggestions and acts on the high\-confidence ”Restore” recommendation\. Thehier\-delibPlanner, facing the same situation, invokesinclude\_tool\_raise\_a\_question\(”What is the most strategic action?”\), theninclude\_tool\_critique\_the\_answer\(”Should I deploy a DecoySSHD or prioritize Enterprise0?”\), and resolves toward prophylactic decoy deployment rather than reactive containment\. By step 14, thehier\-basesystem has executed 5 Restore and 3 Remove actions; thehier\-delibsystem has executed 0 Restore, 0 Remove, and 8 Decoy deployments\. The reward gap at step 14 is−6\.3\-6\.3vs\.−14\.2\-14\.2\.

The gap becomes irreversible at step 15, when the red agent escalates to high\-value targets\. Thehier\-delibsystem, having deployed passive defenses rather than containing active compromise, faces a−13\.1\-13\.1/step penalty that persists for the remaining 16 steps\. It does not execute its first \(and only\) Restore until step 25\.

## Appendix HCybORG CAGE\-2 Environment Details

CybORG CAGE\-2\(TTCP CAGE Challenge Working Group,[2022](https://arxiv.org/html/2605.16205#bib.bib4); Standenet al\.,[2021](https://arxiv.org/html/2605.16205#bib.bib16)\)models an autonomous network\-defense scenario used as Challenge 2 of the TTCP CAGE \(Cyber Autonomy Gym for Experimentation\) series\. We summarize the key environment characteristics; full specifications and source code are available at[https://github\.com/cage\-challenge/cage\-challenge\-2](https://github.com/cage-challenge/cage-challenge-2)\.

##### Network topology\.

The simulated network comprises 13 hosts organized into three subnets:*User*\(User0–User4, 5 hosts\),*Enterprise*\(Enterprise0–Enterprise2 plus a single Enterprise Server, 4 hosts\), and*Operational*\(Op\_Host0–Op\_Host2, Op\_Server0, 4 hosts\)\. Traffic flows from an Internet\-facing subnet through User to Enterprise to Operational\. Op\_Server0 is the attacker’s ultimate high\-value target\.

##### Attacker \(Red agent\)\.

The red agent follows a scripted kill chain that progresses through five stages:*Discovery*\(scanning for reachable hosts\),*Access*\(exploiting vulnerabilities to gain initial foothold\),*Lateral Movement*\(pivoting across subnets\),*Privilege Escalation*\(gaining root/admin access\), and*Impact*\(compromising high\-value targets for maximum penalty\)\. The attacker advances one stage per step and selects new targets based on reachability\. The scripted policy is fixed across all experiments; no adaptive adversary is used\.

##### Defender actions\.

The blue agent \(defender\) selects one action per step from a five\-type repertoire:

Monitor:\(free, observational\): Observe the network\. Returns the current observation with no side effects\.

Analysehostname=X:\(free, diagnostic\): Deeper inspection of a host\. Returns process\- and connection\-level details that may reveal compromise indicators not visible in the standard observation\.

Removehostname=X:\(attempted cleanup\): Attempt to remove malicious artifacts from a host\. May fail against persistent threats, leaving the host in an ”unknown” state\.

Restorehostname=X:\(guaranteed cleanup, high penalty\): Reset the host to its known\-good baseline\. Guaranteed to succeed, but incurs a large one\-step penalty and removes all deployed decoys from that host\.

DecoySERVICEhostname=X:\(proactive\): Deploy a decoy service \(e\.g\.,DecoySSHD,DecoyApache,DecoyTomcat\) on a host to attract and detect attacker activity in future steps\. Multiple decoy types can be stacked on a single host\.

##### Reward structure\.

The per\-step rewardrt≤0r\_\{t\}\\leq 0penalizes each compromised host according to its value tier; Operational hosts carry the highest penalties\. Restore incurs an additional fixed penalty at the step it is applied\. The maximum cumulative return \(perfect defense\) is0; all configurations operate in the negative\-return regime\.

##### Episode structure\.

Each episode runs forT=30T=30steps\. The attacker begins with access to the Internet\-facing subnet and progresses inward\. The defender observes partial, noisy information and must infer compromise from limited signals\. Early intervention is rewarded because attacker progress compounds: a compromised User host enables lateral movement to Enterprise, then to Operational targets where penalties are highest\.

## Appendix IEpisode Counts

Table[34](https://arxiv.org/html/2605.16205#A9.T34)reports the number of evaluated episodes for each of the 72 model–configuration pairs\. The standard allocation is1010instances×\\times55runs=50=50episodes per pair\. G3FP uses a reduced default of5×5=255\\times 5=25episodes per configuration due to staged data collection\. Several configurations include extended batches \(marked with†\\dagger\) to reduce uncertainty on key comparisons: G2\.5FLobs\+net\(100 episodes\), Qwenhier\-base\(100 episodes\), and G2\.5FLhier\-base\(75 episodes\)\. The total across all models and configurations is 3,475 episodes \(104,250 agent–environment interaction steps, consuming 283\.9M tokens\)\.

Table 34\.Episode counts per model–configuration pair\. Standard allocation is 10 instances×\\times5 runs = 50 episodes per configuration\.†Configuration includes additional evaluation batches beyond the standard allocation\.
Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

Similar Articles

The Context Gathering Decision Process: A POMDP Framework for Agentic Search

@rohanpaul_ai: New Stanford paper argues that, under equal reasoning budgets, one LLM usually solves multi-hop problems better than ma…

COOPA: A Modular LLM Agent Architecture for Operations Research Problems

More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding

Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning

Submit Feedback

Similar Articles

The Context Gathering Decision Process: A POMDP Framework for Agentic Search
@rohanpaul_ai: New Stanford paper argues that, under equal reasoning budgets, one LLM usually solves multi-hop problems better than ma…
COOPA: A Modular LLM Agent Architecture for Operations Research Problems
More Is Not Always Better: Cross-Component Interference in LLM Agent Scaffolding
Mixture of Debaters: Learn to Debate at Architectural Level in Multi-Agent Reasoning