Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation
Summary
This paper introduces CEO-Bench, a multi-agent benchmark for evaluating LLMs on CEO-level strategic resource reallocation, revealing systematic failure modes and a structural integration–boldness tradeoff.
View Cached Full Text
Cached at: 06/17/26, 05:36 AM
# Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation
Source: [https://arxiv.org/html/2606.17459](https://arxiv.org/html/2606.17459)
###### Abstract
Evaluating the decision\-making capabilities of large language models \(LLMs\) is a growing research priority, yet existing benchmarks focus on isolated cognitive tasks such as reasoning, knowledge retrieval, and economic rationality in stylized settings\. These evaluations overlook the defining challenge of real executive decision\-making: integrating conflicting recommendations from specialized stakeholders under information asymmetry, organizational constraints, and temporal dependencies\. We introduceCEO\-Bench, a multi\-agent benchmark that evaluates LLMs on CEO\-level strategic resource reallocation—the process of redirecting capital across business units in a multi\-round, constraint\-rich organizational environment\. InCEO\-Bench, LLM agents receive conflicting advice from four role\-conditioned C\-suite advisors \(CFO, CTO, COO, CMO\), each with private signals and distinct priorities, and must synthesize these into a concrete allocation plan evaluated along four dimensions: role integration, conditional boldness, history\-sensitive judgment, and plan validity\. Experiments across five frontier models on 13 scenarios reveal that all models achieve high structural validity but diverge sharply on strategic calibration—the hardest capability layer\. We identify systematic failure modes including single\-advisor capture, conservative default under ambiguity, and historical amnesia, and uncover a structural integration–boldness tradeoff: models that engage more deeply with conflicting perspectives tend to produce less decisive action\. These findings delineate the current capability boundary of LLMs as organizational decision\-makers and inform the design of future AI\-assisted executive systems\.
Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi\-Role Agent Simulation
Yuyang Dai1and Xueqing Peng2and Lingfei Qian2and Zhuohan Xie11MBZUAI2Yale University
## 1Introduction
Large language models \(LLMs\) have rapidly expanded the frontier of artificial intelligence, demonstrating increasingly strong capabilities in reasoning, strategic planning, and autonomous agent behavior\(Parket al\.,[2023](https://arxiv.org/html/2606.17459#bib.bib16); Yanget al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib23)\)\. Recent work explores LLMs as strategic reasoners in game\-theoretic environments\(Lorè and Heydari,[2023](https://arxiv.org/html/2606.17459#bib.bib25); Zhanget al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib24); Wanget al\.,[2026](https://arxiv.org/html/2606.17459#bib.bib26)\), economic decision\-makers under uncertainty\(Ramanet al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib21); Fishet al\.,[2025](https://arxiv.org/html/2606.17459#bib.bib31)\), and participants in multi\-agent systems that collaborate, debate, and negotiate toward collective outcomes\(Guoet al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib30); Sunet al\.,[2025](https://arxiv.org/html/2606.17459#bib.bib20)\)\. However, existing evaluations largely focus on isolated reasoning tasks or stylized interactions, leaving open a more difficult question: can LLM agents handle executive\-level strategic decisions that require integrating conflicting stakeholder interests, incomplete information, and dynamic organizational constraints?
We argue that CEO\-level resource reallocation provides an ideal testbed for stress\-testing this boundary\. CEO\-level resource reallocation: redirecting capital, talent, and organizational attention across business units, is one of the highest\-leverage decisions in corporate strategy\. McKinsey’s longitudinal analysis of 1,600 firms shows that companies reallocating more than 50% of capital expenditure across business units over a decade create substantially more value than less active counterparts\(Atsmon,[2016](https://arxiv.org/html/2606.17459#bib.bib15); Hallet al\.,[2012](https://arxiv.org/html/2606.17459#bib.bib32)\)\. Critically, the difficulty of these decisions is not purely analytical but organizational\. CEOs must integrate conflicting recommendations from functional executives: the CFO emphasizing financial risk, the CTO advocating for R&D investment, the COO prioritizing operational continuity, and the CMO focusing on market opportunity, each operating with asymmetric information and divergent incentives\(Eisenhardt and Zbaracki,[1992](https://arxiv.org/html/2606.17459#bib.bib6); Stein,[1997](https://arxiv.org/html/2606.17459#bib.bib12)\)\.Mintzberg and others \([1975](https://arxiv.org/html/2606.17459#bib.bib1)\)characterized managerial work as the synthesis of fragmented information streams into coherent action, and subsequent research argues that this integrative judgment under cross\-functional conflict is a defining feature of effective executive decision\-making\(Barnard,[1968](https://arxiv.org/html/2606.17459#bib.bib4); March and Simon,[1993](https://arxiv.org/html/2606.17459#bib.bib5); Burgelman,[1983](https://arxiv.org/html/2606.17459#bib.bib7)\)\.
Despite rapid progress in LLM evaluation, no existing benchmark systematically tests this class of organizational decision\-making capability\. Current paradigms primarily evaluate isolated skills such as factual knowledge, mathematical reasoning, code generation, or economic rationality in stylized settings\(Ramanet al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib21); Fishet al\.,[2025](https://arxiv.org/html/2606.17459#bib.bib31)\)\. Role\-playing benchmarks assess whether LLMs maintain character consistency in dialogue\(Wanget al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib17); Yuanet al\.,[2025](https://arxiv.org/html/2606.17459#bib.bib18); Wanget al\.,[2025](https://arxiv.org/html/2606.17459#bib.bib19)\), but not whether role\-conditioned agents produce functionally differentiated advice that a decision\-maker must reconcile\. Similarly, multi\-agent research has explored debate, social simulation, and cooperative problem\-solving\(Parket al\.,[2023](https://arxiv.org/html/2606.17459#bib.bib16); Guoet al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib30)\), yet these settings generally lack the hierarchical authority structure, asymmetric information, and temporal dependencies that define real organizational decision\-making\. While recent work has begun exploring LLMs in strategic environments\(Allen and McDonald,[2026](https://arxiv.org/html/2606.17459#bib.bib22); Stoeberet al\.,[2026](https://arxiv.org/html/2606.17459#bib.bib28)\), the question of whether LLM agents can perform integrative executive judgment under cross\-functional conflict remains open\.
This paper addresses this gap by introducingCEO\-Bench, a multi\-agent benchmark for evaluating LLMs onCEO\-level resource reallocation decisions under cross\-functional conflict and asymmetric information\. InCEO\-Bench, LLM agents assume the roles of a CEO and four C\-suite executives \(CFO, CTO, COO, and CMO\), each with distinct priorities and role\-specific information\. The CEO\-agent must integrate these conflicting recommendations into resource allocation decisions across multiple rounds with evolving organizational and market conditions\.
Our contributions are threefold\.
❍We introduceCEO\-Bench, the first benchmark for evaluating integrative executive decision\-making in multi\-role organizational environments\.
❍We propose a multi\-dimensional evaluation framework measuring role integration, strategic boldness calibration, and temporal coherence beyond single\-score accuracy metrics\.
❍We conduct experiments across frontier LLMs, we characterize key failure modes of organizational decision\-making agents, including excessive consensus\-seeking, context\-insensitive boldness, and historical inconsistency\.
## 2Related Work
Table 1:Comparison with prior work\. ✓ = fully addressed; ❍ = partial; ✗ = not addressed\.MR: multi\-role functional differentiation with asymmetric information\.HI: hierarchical integration under conflict\.BC: boldness calibration to context\.TR: multi\-round temporal coherence\.OG: grounded in organizational decision\-making theory\.### 2\.1Executive Decision\-Making and Resource Allocation
Executive decision\-making has long been studied as a problem of integrating fragmented and often conflicting information under uncertainty\. Organization theory characterizes CEOs as coordinators of informational, interpersonal, and decisional processes rather than purely analytical optimizers\(Barnard,[1968](https://arxiv.org/html/2606.17459#bib.bib4); Mintzberg and others,[1975](https://arxiv.org/html/2606.17459#bib.bib1); March and Simon,[1993](https://arxiv.org/html/2606.17459#bib.bib5)\)\. More recent work identifies bold and timely resource reallocation as one of the strongest predictors of long\-term firm performance\(Dewaret al\.,[2022](https://arxiv.org/html/2606.17459#bib.bib29); Hallet al\.,[2012](https://arxiv.org/html/2606.17459#bib.bib32); Atsmon,[2016](https://arxiv.org/html/2606.17459#bib.bib15)\)\. Complementary research in finance further shows that internal capital allocation occurs under asymmetric information, competing incentives, and organizational frictions such as rent\-seeking and agency problems\(Stein,[1997](https://arxiv.org/html/2606.17459#bib.bib12); Rajanet al\.,[2000](https://arxiv.org/html/2606.17459#bib.bib13); Ozbas and Scharfstein,[2010](https://arxiv.org/html/2606.17459#bib.bib14)\)\. Taken together, this literature suggests that CEO\-level resource reallocation requires two capabilities largely absent from existing LLM evaluations: integrating conflicting recommendations from specialized advisors with private information, and calibrating strategic boldness under evolving organizational conditions over time\(Eisenhardt and Zbaracki,[1992](https://arxiv.org/html/2606.17459#bib.bib6); Boudreaux and Holcombe,[1989](https://arxiv.org/html/2606.17459#bib.bib8)\)\.
### 2\.2LLM\-Based Multi\-Agent Systems and Role\-Playing
Recent work has increasingly explored LLMs as autonomous and multi\-agent systems\. Prior studies show that inter\-agent interaction, debate, and self\-reflection can improve reasoning and coordination quality over single\-agent baselines\(Parket al\.,[2023](https://arxiv.org/html/2606.17459#bib.bib16); Duet al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib35)\)\. A parallel line of work investigates LLM role\-playing, evaluating whether agents can maintain consistent personas, values, and dialogue behavior across multi\-turn interactions\(Wanget al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib17); Yuanet al\.,[2025](https://arxiv.org/html/2606.17459#bib.bib18); Wanget al\.,[2025](https://arxiv.org/html/2606.17459#bib.bib19)\)\. However, these benchmarks primarily evaluate persona consistency rather than*functional differentiation*and*integrative decision quality*—for example, whether role\-conditioned agents produce substantively different recommendations and whether a higher\-level agent can effectively reconcile them\. Existing surveys of LLM\-based multi\-agent systems similarly identify coordination, evaluation, and temporal coherence as open challenges, while noting the scarcity of benchmarks involving hierarchical authority, asymmetric information, and sustained organizational decision\-making\(Guoet al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib30); Sunet al\.,[2025](https://arxiv.org/html/2606.17459#bib.bib20)\)\.
### 2\.3Strategic Reasoning and Economic Decision\-Making Benchmarks
A growing literature evaluates LLMs on strategic and economic decision\-making beyond factual recall and logical reasoning\. Prior work studies LLM behavior in game\-theoretic environments, negotiation, sequential planning, and economic rationality, revealing both emerging strategic capabilities and systematic failures such as excessive cooperation, inconsistent risk attitudes, and weak opponent modeling\(Lorè and Heydari,[2023](https://arxiv.org/html/2606.17459#bib.bib25); Zhanget al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib24); Wanget al\.,[2026](https://arxiv.org/html/2606.17459#bib.bib26); Zhanget al\.,[2026](https://arxiv.org/html/2606.17459#bib.bib27); Ramanet al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib21); Fishet al\.,[2025](https://arxiv.org/html/2606.17459#bib.bib31)\)\. Closest to our work, recent strategy\-oriented benchmarks show that frontier models can generate coherent business strategies but struggle with dynamic adaptation, competitive response, and differentiated strategic positioning\(Allen and McDonald,[2026](https://arxiv.org/html/2606.17459#bib.bib22); Stoeberet al\.,[2026](https://arxiv.org/html/2606.17459#bib.bib28)\)\. However, these evaluations remain largely single\-agent or low\-dimensional, lacking the multi\-role hierarchy, asymmetric information, and integrative organizational judgment required in real executive decision\-making\.
## 3CEO\-Bench: Benchmark Design
This section describes the design ofCEO\-Bench, a benchmark for evaluating LLM agents on CEO\-level resource reallocation decisions\. We present the task definition \(§[3\.1](https://arxiv.org/html/2606.17459#S3.SS1)\), the role\-conditioned agent architecture \(§[3\.2](https://arxiv.org/html/2606.17459#S3.SS2)\), the scenario construction \(§[3\.3](https://arxiv.org/html/2606.17459#S3.SS3)\), and the evaluation metrics \(§[3\.4](https://arxiv.org/html/2606.17459#S3.SS4)\)\.
### 3\.1Task Definition
CEO\-Benchcenters on a resource reallocation decision task in a multi\-business\-unit firm\. In each scenario, the agent assumes the role of a CEO overseeing a company with several business units that differ in growth prospects, execution risks, and strategic roles\. The CEO’s objective is to produce a cross\-unit reallocation plan that shifts capital away from some units and toward others, subject to organizational and financial constraints\. Each benchmark instance is uniquely identified by a\(company\_id,target\_date\)\(\\textit\{company\\\_id\},\\textit\{target\\\_date\}\)pair\. At each decision step, the CEO receives structured information comprising: \(i\) company\-level financial and strategic state, \(ii\) business\-unit\-level performance and absorptive capacity, \(iii\) reallocation constraints, \(iv\) role\-conditioned recommendations from four C\-suite advisors, and \(v\) historical allocation decisions up to the current date\. The CEO outputs a structured allocation plan, represented as a redistribution of resource shares across business units, which includes:
❑a set of units from which resources are removed \(sources\),
❑a set of units to which resources are added \(destinations\),
❑a total reallocation share \(the magnitude of the shift\),
❑a decision type label \(e\.g\., conservative, moderate, or bold\), and
❑a rationale explaining how conflicting executive recommendations were integrated\.
Critically, the action space is continuous rather than discrete: the CEO does not select from predefined options but must construct a strategically appropriate allocation plan under organizational and financial constraints\. The benchmark is further structured as a multi\-round decision problem in which prior allocations affect future organizational states, enabling evaluation of both single\-step decision quality and history\-sensitive strategic consistency over time\.
### 3\.2Agent Architecture
CEO\-Benchadopts a role\-conditioned multi\-view decision architecture consisting of five executive roles: CEO, CFO, CTO, COO, and CMO \(Table[2](https://arxiv.org/html/2606.17459#S3.T2)\)\. The four non\-CEO roles serve as functional advisors with distinct priorities, constraints, and private information\. This design reflects the core managerial challenge identified in organization theory: executive judgment requires integration under competing functional logics rather than isolated analytical optimization\(Mintzberg and others,[1975](https://arxiv.org/html/2606.17459#bib.bib1)\)\.
Table 2:Role specifications inCEO\-Bench\.##### Information asymmetry\.
All roles share the same company scenario but receive different private signals and role\-specific constraints\. For example, the CFO may observe tighter financing constraints, the CTO infrastructure bottlenecks, the COO operational fragility, and the CMO narrowing demand windows\. These asymmetries force the CEO to reconcile recommendations that are locally rational but globally incompatible\(Eisenhardt and Zbaracki,[1992](https://arxiv.org/html/2606.17459#bib.bib6); Stein,[1997](https://arxiv.org/html/2606.17459#bib.bib12)\)\.
##### Advisor output format\.
Each advisor produces a structured recommendation consisting of \(i\) an allocation preference, \(ii\) a short rationale, \(iii\) a primary risk assessment, and \(iv\) an opposition condition under which the role would resist the proposed allocation\. The CEO then integrates these recommendations into a final plan\.
##### Implementation\.
The current implementation uses a single\-agent CEO baseline: one LLM simulates the advisor perspectives and performs CEO\-level synthesis in a single reasoning pass\. This isolates integrative decision\-making from inter\-agent communication effects and provides a baseline for future multi\-agent extensions\.
### 3\.3Scenario Design
The benchmark scenarios are designed to capture the organizational properties that make CEO resource allocation difficult, including cross\-functional conflict, information asymmetry, dynamic constraints, and path\-dependent outcomes\.
##### Company structure\.
Each company contains multiple business units with distinct strategic roles, such as legacy cash\-generating units, customer\-retention units, growth units, and productivity\-oriented units\. Units differ in allocation share, ROI trend, growth outlook, execution risk, absorptive capacity, and strategic importance\.
##### Organizational state\.
Each round additionally specifies a company\-level organizational state, including cash runway, leverage ratio, revenue growth, margin profile, transformation pressure, capacity constraints, and strategic priorities\. This structure requires the CEO to balance unit\-level opportunities against company\-level feasibility constraints\.
##### Multi\-round dynamics\.
Scenarios evolve across multiple decision rounds\. Organizational states change as a function of prior allocations and external environment shifts, allowing the benchmark to capture path\-dependent effects such as delayed transformation pressure, absorptive\-capacity overload, operational instability from sustained underinvestment, and accumulating financial fragility\.
##### Difficulty tiers\.
To vary organizational complexity, scenarios are divided into four tiers \(Table[3](https://arxiv.org/html/2606.17459#S3.T3)\)\. These tiers enable analysis of how model performance changes as organizational complexity increases\.
Table 3:Scenario difficulty tiers inCEO\-Bench\.
##### Multiple acceptable profiles\.
The benchmark avoids prescribing a single hidden “correct answer” for most scenarios\. Instead, many scenarios admit multiple acceptable strategic profiles, such as aggressive growth, sequenced rebalancing, or compromise strategies balancing growth and stability\. Each profile specifies acceptable allocation ranges, preferred source and destination units, and profile\-specific constraint tolerances\. This design evaluates strategic judgment and coherence rather than exact answer matching\.
### 3\.4Evaluation Metrics
CEO\-Benchevaluates models along four dimensions corresponding to the benchmark’s core research questions\.
#### 3\.4\.1Role Integration
Role integration measures whether the CEO synthesizes conflicting executive perspectives rather than collapsing into a single functional viewpoint\. The evaluator checks whether the allocation plan reflects trade\-offs across advisor recommendations, acknowledges key tensions, and avoids ignoring critical operational, financial, or strategic concerns\.
#### 3\.4\.2Conditional Boldness
Conditional boldness measures whether the aggressiveness of reallocation is calibrated to the organizational state rather than defaulting to uniformly conservative or aggressive behavior\. The evaluator considers total reallocation magnitude, scenario\-specific boldness thresholds, and alignment with acceptable strategic profiles such as growth pivots, sequenced rebalancing, or stability\-oriented strategies\(Dewaret al\.,[2022](https://arxiv.org/html/2606.17459#bib.bib29); Atsmon,[2016](https://arxiv.org/html/2606.17459#bib.bib15)\)\.
#### 3\.4\.3History\-Sensitive Judgment
History\-sensitive judgment measures whether the CEO incorporates prior organizational trajectory when making decisions across rounds\. The evaluator penalizes short\-sighted behaviors such as repeatedly overfunding overloaded units or persistently underinvesting in protected units requiring stable support\. This distinguishes static snapshot optimization from sequential organizational reasoning\(Zhanget al\.,[2026](https://arxiv.org/html/2606.17459#bib.bib27)\)\.
#### 3\.4\.4Plan Validity and Strategic Fit
Plan validity measures whether proposed reallocations satisfy hard organizational constraints, including mass balance, allocation caps, unit\-level constraints, locked capital restrictions, and scenario\-specific transfer rules\. Beyond feasibility, strategic fit evaluates alignment between the proposed allocation and acceptable strategic profiles, including appropriate destination units, protected\-unit preservation, and avoidance of overload risk\. Scores are aggregated through a deterministic rule\-based evaluator that produces overall scores, letter grades, and structured failure\-mode labels\. This design ensures reproducibility without relying on LLM\-as\-judge evaluation\(Ramanet al\.,[2024](https://arxiv.org/html/2606.17459#bib.bib21)\)\.
## 4Experiments
### 4\.1Experimental Setup
We evaluate five large language models spanning open\-weight and API\-hosted families: Gemma\-4\-27B\-A4B\-IT, GPT\-OSS\-20B, Qwen\-Plus\-2025\-07\-28, Claude\-3\.5\-Haiku, and NVIDIA Nemotron\-Nano\-9B\-v2\. Each model is tested on the same 13CEO\-Benchscenarios\. All models receive the same structured prompt template, which includes company state, business\-unit state, reallocation constraints, benchmark context, decision history, and four role\-conditioned advisor views \(CFO, CTO, COO, and CMO\)\. Generation uses a low\-temperature configuration \(τ=0\.2\\tau=0\.2\) to reduce variance\. In the current evaluation stage, each model is run once per scenario to establish baseline performance; future work will incorporate repeated runs to quantify stochastic variation\.
### 4\.2Overall Results
Table[4](https://arxiv.org/html/2606.17459#S4.T4)reports the main benchmark results across the four evaluation dimensions defined in §[3\.4](https://arxiv.org/html/2606.17459#S3.SS4)\. Gemma\-4\-27B\-A4B\-IT achieves the strongest overall performance with an average score of 73\.86, followed by GPT\-OSS\-20B at 71\.90\. Qwen\-Plus, Claude\-3\.5\-Haiku, and Nemotron\-Nano\-9B form a second tier with noticeably lower overall scores and more frequent invalid or weakly calibrated plans\. Gemma and GPT\-OSS\-20B also produce the highest number of valid plans \(13/13\), suggesting that strong overall performance depends not only on strategic fit but also on consistent feasibility under hard constraints\.
Table 4:Overall performance across evaluation dimensions\. Scores are averaged across all 13 scenarios\. Valid Plans reports the number of scenarios \(out of 13\) in which the model produced a structurally valid reallocation plan\.Difficulty\-tier breakdowns \(Table[5](https://arxiv.org/html/2606.17459#S4.T5)\) reveal sharper separation among models\. Gemma remains comparatively robust in easy and fragile settings, while GPT\-OSS\-20B is particularly competitive on adversarial scenarios\. Claude\-3\.5\-Haiku and Nemotron\-Nano\-9B show larger instability under high\-conflict conditions, although the specific weakness differs: Claude more often collapses on adversarial cases, whereas Nemotron suffers from lower average calibration across easy and fragile settings\. Notably, Qwen\-Plus and Nemotron achieve their highest scores on adversarial scenarios, suggesting that these models may benefit from the stronger signal\-to\-noise ratio in extreme conflict settings, even though their overall calibration remains weaker\.
Table 5:Average score by scenario difficulty tier\. Higher scores indicate better performance within each tier\.Figure[1](https://arxiv.org/html/2606.17459#S4.F1)provides a visual summary of the main results\. Panel \(a\) shows overall scores, confirming the two\-tier structure\. Panel \(b\) decomposes performance by evaluation dimension, revealing that plan validity is consistently the strongest dimension across all models, while boldness calibration is the weakest—indicating that generating feasible plans is substantially easier than calibrating their strategic aggressiveness\. Panel \(c\) shows performance by difficulty tier, and Panel \(d\) displays the distribution of outcome grades across models\.
Figure 1:Summary ofCEO\-Benchresults across five models\. \(a\) Overall scores\. \(b\) Scores decomposed by evaluation dimension: role integration, boldness calibration, history sensitivity, and plan validity\. \(c\) Average scores by difficulty tier \(easy, fragile, adversarial\)\. \(d\) Distribution of outcome grades per model: invalid plan, misallocated, acceptable with trade\-offs, and high quality\.
### 4\.3Role Integration Analysis
Role integration evaluates whether the CEO meaningfully reconciles conflicting functional advice instead of implicitly following a single advisor\. Qualitatively, the strongest models produce final rationales that explicitly trade off financial resilience \(CFO\), technical absorptive capacity \(CTO\), operational continuity \(COO\), and market timing \(CMO\)\. GPT\-OSS\-20B performs best on this dimension \(81\), followed closely by Gemma \(78\), indicating that these models are more likely to articulate why one advisor’s concern should dominate in a particular setting without completely ignoring the others\.
By contrast, weaker models exhibit one of two characteristic failure patterns\. First, some models collapse into a*dominant\-function decision*, typically over\-weighting either growth logic \(following the CMO\) or risk control \(following the CFO\), while treating other perspectives as secondary commentary\. Second, some models produce*consensus summaries*—they acknowledge multiple advisors in their rationale but fail to convert those views into a coherent allocation trade\-off\. In both cases, the resulting plan often appears plausible on the surface but performs poorly under post\-hoc strategic evaluation because it does not reflect genuine integrative judgment\.
### 4\.4Conditional Boldness Analysis
Conditional boldness is a central challenge inCEO\-Bench\. The benchmark rewards bold reallocations only when the company’s financial and organizational state can support them; conversely, it penalizes aggressive plans when capacity constraints or financial fragility make large shifts risky\. Across models, the most common systematic failure is*under\-reallocation*: many plans are valid but not aggressive enough for the scenario\. This pattern is especially pronounced for Nemotron\-Nano\-9B \(58\) and Qwen\-Plus \(61\), both of which score materially lower on boldness calibration than on role integration, suggesting that these models can identify the relevant trade\-offs but default to conservative execution\.
Gemma \(70\) and GPT\-OSS\-20B \(68\) perform best on this dimension, indicating a better ability to distinguish between scenarios that call for a strong portfolio shift and those that require staged rebalancing\. Nevertheless, even these stronger models remain imperfect: both still exhibit conservative drift on a non\-trivial subset of fragile and transformation\-heavy scenarios\. This finding is consistent withStoeberet al\.\([2026](https://arxiv.org/html/2606.17459#bib.bib28)\)’s observation that LLMs tend toward strategic conformity, and suggests that the bias toward caution may be a general property of instruction\-tuned models rather than a model\-specific weakness\.
### 4\.5History\-Sensitive Judgment
CEO\-Benchis designed so that historical allocation trajectories matter: repeated overfunding can overload destination units beyond their absorptive capacity, and repeated underfunding can destabilize protected units that require stable investment\. On this dimension, GPT\-OSS\-20B achieves the strongest score \(74\), followed by Gemma \(72\), suggesting that these models are more consistent in producing historically coherent plans\. In qualitative inspection, their rationales are more likely to acknowledge prior reallocation patterns, reference earlier capacity buildups, and reason about transformation sequencing across rounds\.
The weaker models appear more myopic\. Nemotron\-Nano\-9B \(62\) and Claude\-3\.5\-Haiku \(65\) more often behave as if each round were a fresh optimization problem, producing plans that fit the current snapshot but are inconsistent with the organization’s earlier allocation path\. This pattern is reflected in lower history\-sensitive scores and a higher incidence of*repeated strategic overcorrection*—alternating between aggressive funding and defunding of the same unit across consecutive rounds\. This behavior echoes the “historical amnesia” failure mode, where the model treats each decision context as independent despite being given full access to historical state\.
### 4\.6Failure Mode Analysis
Failure mode analysis reveals that the performance gap across models is not driven by a single weakness but by distinct failure profiles\. Figure[1](https://arxiv.org/html/2606.17459#S4.F1)\(d\) shows the distribution of outcome grades across models\. Gemma produces the fewest invalid plans and the highest number of high\-quality outcomes, while GPT\-OSS\-20B combines strong feasibility with more balanced trade\-off behavior\. Qwen\-Plus is generally competitive but remains prone to under\-boldness and misallocation\. Claude\-3\.5\-Haiku shows larger variance, with a mixture of acceptable decisions and sharp failures in conflict\-heavy scenarios\. Nemotron\-Nano\-9B is the weakest overall, driven primarily by lower boldness calibration and a larger volume of misallocated outcomes\.
A key benchmark\-level insight is thatinvalidity alone does not explain poor performance\. Several models generate formally valid plans—satisfying mass balance, share caps, and floor constraints—while still failing strategically because the plan is not bold enough for the organizational state, over\-concentrates capital in the wrong units, or ignores historical reallocation path effects\. This supports the benchmark’s central design claim: CEO\-level decision quality requires more than feasibility or generic reasoning ability, and meaningful performance differences emerge only once role conflict, information asymmetry, and dynamic constraints are jointly imposed\.
Table[6](https://arxiv.org/html/2606.17459#S4.T6)summarizes the most frequent failure modes by model\. Across all models,not\_bold\_enough\_for\_stateis the single most common failure label, followed bymisallocated\(capital directed to suboptimal units\) andhistory\_inconsistent\(decisions that contradict prior allocation trajectory\)\. Invalid plans—violations of hard structural constraints—account for a smaller share of failures, confirming that the primary challenge of CEO\-level decision\-making lies in strategic calibration rather than plan construction\.
Table 6:Most frequent failure modes by model\. Each cell reports the number of scenarios \(out of 13\) in which the failure mode was triggered\.
### 4\.7Case Study: Contrasting Decision Profiles
We present two case studies to illustrate how models diverge in reasoning strategies and where those divergences produce materially different outcomes\.
#### 4\.7\.1Case 1: Fragile—Platform Transition Under Financial Stress
Company C3 operates four business units and faces a transformation dilemma: a platform unit \(Unit C\) has a rapidly expanding market, but cash runway is tightening \(8\.2 months\) and leverage is elevated \(3\.1×\\timesdebt/EBITDA\)\. The four advisors produce conflicting recommendations: the CFO warns that large shifts risk covenant review; the CTO argues Unit C needs critical\-mass investment but cannot absorb more than 6pp without prerequisite infrastructure; the COO flags fragile service capacity in Unit B; and the CMO pushes for aggressive 20–25% reallocation, citing a closing demand window\.
Table[7](https://arxiv.org/html/2606.17459#S4.T7)summarizes model responses\. GPT\-OSS and Gemma land in the moderate zone \(12–14%\) that balances growth investment against covenant risk, and both demonstrate sequencing awareness by deferring additional reallocation to future rounds\. Claude adopts the CMO’s framing almost entirely \(21% reallocation\), exceeding both the CFO’s covenant threshold and the CTO’s absorptive capacity—a*single\-advisor capture*failure\. Qwen\-Plus and Nemotron err conservatively \(6–7%\), producing structurally valid but strategically insufficient plans that do not engage with the demand\-window argument\.
Table 7:Case 1 \(fragile\): Company C3 platform transition\.
#### 4\.7\.2Case 2: Adversarial—Post\-Disruption Portfolio Restructuring
Company C7 faces a reversal scenario at decision round 3\. Prior rounds incrementally increased funding to a high\-growth unit \(Unit E\), but an exogenous shock has reduced Unit E’s addressable market by 35%, while the legacy unit \(Unit F\) has stabilized\. The COO warns that rapid defunding will trigger organizational disruption from recent team scaling\.
Table[8](https://arxiv.org/html/2606.17459#S4.T8)summarizes responses\. The performance hierarchy inverts relative to overall rankings\. Claude collapses to a score of 35 by refusing to reverse course—a form of*path\-dependent anchoring*that treats the prior trajectory as a commitment rather than a hypothesis\. Qwen\-Plus and Nemotron, the weakest overall models, perform well because the strong reversal signal reduces ambiguity, favoring their simpler heuristics\. GPT\-OSS achieves the most nuanced response: a substantial reversal \(15%\) that incorporates the COO’s staged\-drawdown recommendation and preserves the CTO’s optionality argument\.
Table 8:Case 2 \(adversarial\): Company C7 post\-disruption restructuring\.##### Cross\-case synthesis\.
The two cases reveal a fundamental tension: capabilities that produce good performance on ambiguous scenarios \(careful integration, conservative hedging\) become liabilities on clear\-signal reversal scenarios, and vice versa\. Only GPT\-OSS and Gemma demonstrate*conditional flexibility*across scenario types\. A detailed cross\-dimension interaction analysis is provided in Appendix E\.
## 5Conclusion
This paper introducedCEO\-Bench, a benchmark for evaluating LLM agents on CEO\-level resource reallocation under cross\-functional conflict, information asymmetry, and dynamic organizational constraints\. By grounding the evaluation in the management science of executive decision\-making\(Mintzberg and others,[1975](https://arxiv.org/html/2606.17459#bib.bib1); Dewaret al\.,[2022](https://arxiv.org/html/2606.17459#bib.bib29)\)and constructing a role\-conditioned multi\-view architecture with four functionally differentiated advisors,CEO\-Benchtests a class of decision\-making capability that existing benchmarks do not address: the ability to synthesize conflicting stakeholder inputs into a coherent, context\-sensitive strategic action\.
Our experiments across five frontier models yield three principal findings\. First, structural competence—the ability to produce valid, constraint\-satisfying plans—is largely solved; the meaningful variance lies in strategic calibration\. Second, current LLMs exhibit a systematic*integration–boldness tradeoff*: deeper engagement with conflicting perspectives leads to more hedged, moderate actions, suggesting that models have not yet learned to decouple understanding from caution\. Third, failure modes are model\-specific and scenario\-dependent—the same cognitive tendency \(e\.g\., commitment to prior trajectory\) can be an asset in ambiguous settings and a liability in reversal scenarios—highlighting the importance of multi\-dimensional, context\-sensitive evaluation\. These findings carry implications for both AI evaluation and organizational practice\.
## Limitations
CEO\-Benchcurrently contains 13 scenarios across 7 synthetic companies, which is sufficient to identify systematic failure modes but limits generalizability across industries, geographies, and organizational archetypes\. Each model is evaluated once per scenario at low temperature; repeated sampling would be needed to quantify stochastic variance and compute confidence intervals\. The current implementation uses a single\-agent baseline in which one LLM simulates both advisor views and CEO synthesis, which enables controlled evaluation of integrative judgment but does not test genuine multi\-agent interaction dynamics where independently instantiated advisors might produce more diverse or adversarial recommendations\. Finally, the deterministic rule\-based evaluator ensures reproducibility but may not capture all dimensions of decision quality—particularly cases where a model produces a sound plan for poorly articulated reasons, or an articulate rationale that leads to a suboptimal allocation\.
## Future Work
Three directions are most immediate\. First, extendingCEO\-Benchto a fully independent multi\-agent architecture—where each C\-suite advisor is a separate LLM instance with its own context window and communication protocol—would test whether inter\-agent deliberation improves or degrades CEO\-level integration quality\. Second, incorporating human expert baselines \(e\.g\., MBA students or practicing executives\) would enable direct comparison between LLM and human judgment on the same scenarios, grounding the benchmark in the decision\-making standard it aims to approximate\. Third, scaling the scenario set to include more companies, longer decision horizons, and domain\-specific variants \(e\.g\., healthcare, technology, manufacturing\) would strengthen the benchmark’s external validity and enable sector\-level capability analysis\.
## Ethics Statement
CEO\-Benchis a research benchmark designed to evaluate and characterize the capabilities and failure modes of LLM agents on strategic decision\-making tasks\. It is not intended to replace human judgment in real corporate governance, nor do we advocate for autonomous AI\-driven resource allocation in organizations\. All scenarios are synthetic and do not represent real companies, individuals, or proprietary data\. We acknowledge that deploying LLMs in executive decision\-support roles carries risks including over\-reliance on AI recommendations, reinforcement of model\-specific biases \(e\.g\., conservative default\), and potential reduction of human deliberative processes; our work aims to make these risks more visible and measurable rather than to accelerate uncritical deployment\.
## References
- R\. T\. Allen and R\. M\. McDonald \(2026\)How well can ai do strategy? empirical benchmarking using strategy simulations\.Strategy Science11\(1\),pp\. 93–117\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.17459#S2.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.13.13.1)\.
- Y\. Atsmon \(2016\)How nimble resource allocation can double your company’s value\.Business Functions\. May25,pp\. 2018\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1),[§3\.4\.2](https://arxiv.org/html/2606.17459#S3.SS4.SSS2.p1.1)\.
- C\. I\. Barnard \(1968\)The functions of the executive\.Vol\.11,Harvard university press\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1)\.
- D\. J\. Boudreaux and R\. G\. Holcombe \(1989\)The coasian and knightian theories of the firm\.Managerial and Decision Economics,pp\. 147–154\.Cited by:[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1)\.
- R\. A\. Burgelman \(1983\)A process model of internal corporate venturing in the diversified major firm\.Administrative science quarterly,pp\. 223–244\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p2.1)\.
- C\. Dewar, S\. Keller, and V\. Malhotra \(2022\)CEO excellence: the six mindsets that distinguish the best leaders from the rest\.Simon and Schuster\.Cited by:[§C\.2](https://arxiv.org/html/2606.17459#A3.SS2.p1.4),[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1),[§3\.4\.2](https://arxiv.org/html/2606.17459#S3.SS4.SSS2.p1.1),[§5](https://arxiv.org/html/2606.17459#S5.p1.1)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2024\)Improving factuality and reasoning in language models through multiagent debate\.InForty\-first international conference on machine learning,Cited by:[§2\.2](https://arxiv.org/html/2606.17459#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.8.8.1)\.
- K\. M\. Eisenhardt and M\. J\. Zbaracki \(1992\)Strategic decision making\.Strategic management journal13\(S2\),pp\. 17–37\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.17459#S3.SS2.SSS0.Px1.p1.1)\.
- S\. Fish, J\. Shephard, M\. Li, R\. I\. Shorrer, and Y\. A\. Gonczarowski \(2025\)EconEvals: benchmarks and litmus tests for economic decision\-making by llm agents\.arXiv preprint arXiv:2503\.18825\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p1.1),[§1](https://arxiv.org/html/2606.17459#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.17459#S2.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.12.12.1)\.
- T\. Guo, X\. Chen, Y\. Wang, R\. Chang, S\. Pei, N\. V\. Chawla, O\. Wiest, and X\. Zhang \(2024\)Large language model based multi\-agents: a survey of progress and challenges\.arXiv preprint arXiv:2402\.01680\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p1.1),[§1](https://arxiv.org/html/2606.17459#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.17459#S2.SS2.p1.1)\.
- S\. Hall, D\. Lovallo, and R\. Musters \(2012\)How to put your money where your strategy is\.McKinsey Quarterly2,pp\. 27–38\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1)\.
- N\. Lorè and B\. Heydari \(2023\)Strategic behavior of large language models: game structure vs\. contextual framing\.arXiv preprint arXiv:2309\.05898\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.17459#S2.SS3.p1.1)\.
- J\. G\. March and H\. A\. Simon \(1993\)Organizations\.John wiley & sons\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1)\.
- H\. Mintzberget al\.\(1975\)The manager’s job: folklore and fact\.Harvard business review53\(4\)\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.17459#S3.SS2.p1.1),[§5](https://arxiv.org/html/2606.17459#S5.p1.1)\.
- O\. Ozbas and D\. S\. Scharfstein \(2010\)Evidence on the dark side of internal capital markets\.The Review of Financial Studies23\(2\),pp\. 581–599\.Cited by:[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p1.1),[§1](https://arxiv.org/html/2606.17459#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.17459#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.7.7.1)\.
- X\. Peng, Z\. Xie, Y\. Cao, H\. Li, L\. Qian, Y\. Wang, V\. J\. Zhang, H\. He, X\. Ai, L\. Ma,et al\.\(2026\)Herculean: an agentic benchmark for financial intelligence\.arXiv preprint arXiv:2605\.14355\.Cited by:[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.16.16.1)\.
- R\. Rajan, H\. Servaes, and L\. Zingales \(2000\)The cost of diversity: the diversification discount and inefficient investment\.The journal of Finance55\(1\),pp\. 35–80\.Cited by:[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1)\.
- N\. Raman, T\. Lundy, S\. Amouyal, Y\. Levine, K\. Leyton\-Brown, and M\. Tennenholtz \(2024\)Steer: assessing the economic rationality of large language models\.arXiv preprint arXiv:2402\.09552\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p1.1),[§1](https://arxiv.org/html/2606.17459#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.17459#S2.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.11.11.1),[§3\.4\.4](https://arxiv.org/html/2606.17459#S3.SS4.SSS4.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.9.9.1)\.
- J\. C\. Stein \(1997\)Internal capital markets and the competition for corporate resources\.The journal of finance52\(1\),pp\. 111–133\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.17459#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.17459#S3.SS2.SSS0.Px1.p1.1)\.
- T\. Stoeber, J\. Hammerschmidt, A\. Lundervold, E\. Kromidha, D\. K\. Kanbach, and S\. Kraus \(2026\)AI strategy under institutional pressure: strategic conformity and decision\-making in large language models\.Journal of Business Research212,pp\. 116227\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.17459#S2.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.14.14.1),[§4\.4](https://arxiv.org/html/2606.17459#S4.SS4.p2.1)\.
- C\. Sun, S\. Huang, and D\. Pompili \(2025\)Llm\-based multi\-agent decision\-making: challenges and future directions\.IEEE Robotics and Automation Letters\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.17459#S2.SS2.p1.1)\.
- J\. Wang, J\. Duan, J\. Wang, K\. Song, C\. Xu, J\. K\. Ho, F\. Yu, W\. Li, and J\. F\. Hoorn \(2026\)Foresight optimization for strategic reasoning in large language models\.arXiv preprint arXiv:2604\.13592\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.17459#S2.SS3.p1.1)\.
- N\. Wang, Z\. Peng, H\. Que, J\. Liu, W\. Zhou, Y\. Wu, H\. Guo, R\. Gan, Z\. Ni, J\. Yang,et al\.\(2024\)Rolellm: benchmarking, eliciting, and enhancing role\-playing abilities of large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 14743–14777\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.17459#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.3.3.1)\.
- Y\. Wang, T\. Li, M\. Li, Z\. Cheng, G\. Wang, H\. Kang, Y\. Deng, H\. Xiao, and Y\. Zhang \(2025\)RVBench: role values benchmark for role\-playing llms\.Computers in Human Behavior: Artificial Humans,pp\. 100184\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.17459#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.5.5.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024\)Large language models as optimizers\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 12028–12068\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p1.1)\.
- D\. Yuan, Y\. Chen, G\. Liu, C\. Li, C\. Tang, D\. Zhang, Z\. Wang, X\. Wang, and S\. Liu \(2025\)Dmt\-rolebench: a dynamic multi\-turn dialogue based benchmark for role\-playing evaluation of large language model and agent\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 25760–25768\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p3.1),[§2\.2](https://arxiv.org/html/2606.17459#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.17459#S2.T1.11.4.4.1)\.
- M\. Zhang, S\. Aghaei, and S\. Saghafian \(2026\)Large language models for sequential decision\-making: improving in\-context learning via supervised fine\-tuning\.arXiv preprint arXiv:2605\.09009\.Cited by:[§2\.3](https://arxiv.org/html/2606.17459#S2.SS3.p1.1),[§3\.4\.3](https://arxiv.org/html/2606.17459#S3.SS4.SSS3.p1.1)\.
- Y\. Zhang, S\. Mao, T\. Ge, X\. Wang, A\. de Wynter, Y\. Xia, W\. Wu, T\. Song, M\. Lan, and F\. Wei \(2024\)Llm as a mastermind: a survey of strategic reasoning with large language models\.arXiv preprint arXiv:2404\.01230\.Cited by:[§1](https://arxiv.org/html/2606.17459#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.17459#S2.SS3.p1.1)\.
## Appendix APrompt Templates
All models receive identical prompts\. Scenario\-specific fields \(marked with\{\.\.\.\}\) are populated from the benchmark’s structured data files\.
### A\.1CEO System Prompt
> You are the CEO of \{company\_name\}, a multi\-business\-unit company\. Your task is to produce a resource reallocation plan that redistributes capital across the company’s business units for the target date \{target\_date\}\. Company State: Cash runway: \{cash\_runway\} months; Leverage: \{leverage\}x debt/EBITDA; Revenue growth: \{revenue\_growth\}%; Margin profile: \{margin\_profile\}; Transformation pressure: \{transformation\_pressure\}; Board priority: \{board\_priority\}\. Business Units: \{business\_unit\_table\} Reallocation Constraints: Maximum total reallocation: \{max\_realloc\}% of portfolio; Per\-unit floor: \{unit\_floor\}%; Per\-unit ceiling: \{unit\_ceiling\}%; Locked units: \{locked\_units\}; Transfer restrictions: \{transfer\_rules\}\. Historical Decisions: \{decision\_history\} Advisor Recommendations: \{advisor\_views\} Instructions: \(1\) Analyze each advisor’s recommendation\. \(2\) Identify key tensions and trade\-offs\. \(3\) Produce a final reallocation plan specifying: remove\_from, add\_to, total\_realloc\_share, decision\_type \[conservative/moderate/bold\], and rationale\. \(4\) Ensure mass balance: total removed = total added\. \(5\) Respect all hard constraints\. Output your plan in JSON format: \{output\_schema\}
### A\.2Advisor Role Prompts
Each advisor receives a role\-specific system prompt\. The shared output format requires: allocation\_preference, rationale \(2–3 sentences\), primary\_risk, and opposition\_condition\. Role\-specific instructions are as follows\.
##### CFO\.
Primary objective: capital discipline and downside protection\. Key dimensions: liquidity resilience, covenant compliance, margin preservation, risk\-adjusted returns\. Bias risks: excessive conservatism, under\-weighting growth optionality\. Private signal:\{cfo\_private\_signal\}\. Veto conditions: cash runway below minimum, leverage above threshold, covenant headroom violations\.
##### CTO\.
Primary objective: technological feasibility and long\-term capability building\. Key dimensions: platform leverage, infrastructure readiness, absorptive capacity, technical debt\. Bias risks: over\-investment in speculative technology\. Private signal:\{cto\_private\_signal\}\. Veto conditions: funding beyond absorptive capacity ceiling, defunding infrastructure prerequisites, creating unserviceable technical debt\.
##### COO\.
Primary objective: operational continuity and execution capacity\. Key dimensions: service stability, transition risk, workforce readiness, process resilience\. Bias risks: status\-quo anchoring\. Private signal:\{coo\_private\_signal\}\. Veto conditions: protected unit below viability threshold, workforce transition rate exceeding limits, customer\-facing service fragility\.
##### CMO\.
Primary objective: market timing and demand capture\. Key dimensions: growth\-window preservation, competitive positioning, customer acquisition cost, brand momentum\. Bias risks: growth\-at\-all\-costs mentality\. Private signal:\{cmo\_private\_signal\}\. Veto conditions: missing critical demand window, ceding market position, undercutting sustained brand investment\.
## Appendix BScenario Catalog
Table[9](https://arxiv.org/html/2606.17459#A2.T9)summarizes all 13 scenarios\. The distribution across tiers is: Easy \(3\), Tension \(3\), Fragile \(4\), Adversarial \(3\), intentionally skewed toward higher difficulty as pilot experiments showed easy scenarios do not discriminate among frontier models\.
Table 9:Full scenario catalog forCEO\-Bench\.
## Appendix CEvaluation Scoring Details
### C\.1Plan Validity Checks
A plan must pass all five checks to be valid; failure on any yieldsinvalid\_planand a validity score of 0: \(1\) mass balance \(∑iΔi−=∑jΔj\+\\sum\_\{i\}\\Delta\_\{i\}^\{\-\}=\\sum\_\{j\}\\Delta\_\{j\}^\{\+\}\); \(2\) reallocation cap \(total\_realloc≤max\_realloc\\textit\{total\\\_realloc\}\\leq\\textit\{max\\\_realloc\}\); \(3\) per\-unit floor \(sinew≥flooris\_\{i\}^\{\\text\{new\}\}\\geq\\textit\{floor\}\_\{i\}\); \(4\) per\-unit ceiling \(sinew≤ceilingis\_\{i\}^\{\\text\{new\}\}\\leq\\textit\{ceiling\}\_\{i\}\); \(5\) locked units \(sinew=siolds\_\{i\}^\{\\text\{new\}\}=s\_\{i\}^\{\\text\{old\}\}fori∈ℒi\\in\\mathcal\{L\}\)\.
### C\.2Boldness Calibration Scoring
Each scenario defines acceptable boldness ranges tied to strategic profiles𝒫\\mathcal\{P\}\. Letrrbe the proposed reallocation share\. The score is:
Bold\(r\)=maxp∈𝒫\[\\displaystyle\\text\{Bold\}\(r\)=\\max\_\{p\\in\\mathcal\{P\}\}\\big\[𝟙\[r∈\[rpmin,rpmax\]\]⋅wp\\displaystyle\\mathbb\{1\}\[r\\in\[r\_\{p\}^\{\\min\},r\_\{p\}^\{\\max\}\]\]\\cdot w\_\{p\}\+\\displaystyle\+\\;𝟙\[r∉\[rpmin,rpmax\]\]⋅pen\(r,p\)\]\\displaystyle\\mathbb\{1\}\[r\\notin\[r\_\{p\}^\{\\min\},r\_\{p\}^\{\\max\}\]\]\\cdot\\text\{pen\}\(r,p\)\\big\]\(1\)wherewpw\_\{p\}is the profile weight andpen\(r,p\)\\text\{pen\}\(r,p\)is a distance\-based penalty\. The penalty is asymmetric: under\-reallocation relative to transformation\-heavy profiles is penalized more heavily than over\-reallocation relative to conservative profiles\(Dewaret al\.,[2022](https://arxiv.org/html/2606.17459#bib.bib29)\)\.
### C\.3Role Integration Scoring
Role integration combines three indicators: \(1\) constraint respect \(40%\)—does the plan avoid violating advisor\-flagged constraints; \(2\) multi\-perspective reflection \(30%\)—does the allocation reflect trade\-offs across≥\\geq3 advisory perspectives; \(3\) rationale quality \(30%\)—does the rationale acknowledge tensions and explain prioritization\.
### C\.4History Sensitivity Scoring
For multi\-round scenarios: overload penalty \(5k5kpoints if a unit is net\-funded inkkconsecutive prior rounds despite capacity warnings\); starvation penalty \(5k5kpoints for consecutive defunding of protected units\); reversal bonus \(up to 10 points for correctly executing warranted trajectory change\); consistency bonus \(up to 5 points for explicitly referencing prior decisions\)\.
### C\.5Overall Score Aggregation
The overall score is an equal\-weighted average:Score=0\.25×\(RoleInt\+Boldness\+History\+Validity\)\\text\{Score\}=0\.25\\times\(\\text\{RoleInt\}\+\\text\{Boldness\}\+\\text\{History\}\+\\text\{Validity\}\)\. Model\-level scores are arithmetic means across 13 scenarios\. Grade thresholds: A \(≥\\geq85\), B \(≥\\geq70\), C \(≥\\geq55\), D \(≥\\geq40\), F \(<<40\)\. Failure\-mode labels \(non\-exclusive\):invalid\_plan,not\_bold\_enough,too\_aggressive,misallocated,history\_inconsistent\.
## Appendix DPer\-Scenario Results
Table[10](https://arxiv.org/html/2606.17459#A4.T10)reports the overall score, grade, and primary failure mode for each model×\\timesscenario pair\.
Table 10:Per\-scenario results\. “—” indicates no failure mode triggered\.
## Appendix EModel Details
Table[11](https://arxiv.org/html/2606.17459#A5.T11)provides details on the evaluated models\.
Table 11:Models evaluated inCEO\-Bench\.All models use: temperatureτ=0\.2\\tau=0\.2, max tokens 2048, top\-pp= 0\.95, single run per scenario\. API models use provider endpoints as of June 2026\. Open\-weight models run on a single A100 80GB via vLLM at full precision\.
## Appendix FFailure Mode Co\-occurrence
Table[12](https://arxiv.org/html/2606.17459#A6.T12)reports pairwise co\-occurrence of failure modes across all 65 evaluations\. The most frequent co\-occurrence is betweennot\_boldandmisallocated\(12\), suggesting that under\-reallocation and suboptimal capital targeting co\-occur\. The least frequent isinvalid–history\_inconsistent\(1\), indicating distinct underlying causes\.
Table 12:Failure mode co\-occurrence \(n=65n=65\)\.This is an appendix\.Similar Articles
CEO-Bench: Can Agents Play the Long Game?
CEO-Bench introduces a simulation benchmark that evaluates language model agents' ability to manage a startup over 500 days, testing long-term planning, noise handling, adaptability, and multi-task coordination. Results show that even the strongest models struggle, with only Claude Opus 4.8 and GPT-5.5 finishing above the starting balance.
Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces
Introduces Agent Bazaar, a multi-agent simulation framework for evaluating economic alignment of LLMs, identifying failure modes like algorithmic instability and Sybil deception, and training a 9B model that outperforms frontier models using targeted reinforcement learning.
CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement
CollabBench is a new benchmark for evaluating and training LLM agents in cooperative games, featuring diverse player simulation and a collaborative training paradigm. Experiments show 19.5% higher efficiency and 24.4% improved affective performance over base models.
ORAgentBench: Can LLM Agents Solve Challenging Operations Research Tasks End to End?
本文介绍ORAgentBench,一个用于评估LLM代理在端到端运筹学任务中表现的执行基准,包含107个经过人工审查的任务。实验表明,当前最佳代理仅通过35.51%的任务,揭示了在可靠决策制定方面的重大不足。
Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle
This paper introduces AARR (Act As a Real Researcher), a suite of benchmarks to evaluate frontier LLMs and agentic systems on granular research scenarios. The first benchmark, AARRI-Bench, reveals that even top-performing agents achieve only 68.3% success, highlighting gaps in field sensitivity and nuanced reasoning.