Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
Summary
This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game, identifying four inference-time levers and introducing the concept of agent bullwhip. It shows that a reasoning model can exceed human performance, and proposes GRPO-based post-training to improve reliability.
View Cached Full Text
Cached at: 05/19/26, 06:38 AM
# Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
Source: [https://arxiv.org/html/2605.17036](https://arxiv.org/html/2605.17036)
Huangyuan Su Harvard University/Kempner Institute &Andre P\. Calmon Georgia Tech &Flavio P\. Calmon Harvard University
###### Abstract
This paper studies autonomous generative AI agents in multi\-echelon supply chains using the MIT Beer Game\. We identify four inference\-time levers that shape performance: model selection, policies and guardrails, centralized data sharing, and prompt engineering\. Model capability is the dominant factor: an out\-of\-the\-box reasoning model exceeds human\-level performance, and optimized reasoning models reduce costs by up to 67% relative to human teams\. However, strong average performance masks substantial reliability risks\. We introduce theagent bullwhip, the amplification of decision unreliability across echelons, manifesting along two dimensions: decision variance increases both across facilities at the same point in time and within the same facility across time\. We develop a mathematical framework showing that this phenomenon is inherent to multi\-agent systems that involve coordination and information delays, and we demonstrate that repeated sampling fails to meaningfully reduce it\. To address this limitation, we propose a Group Relative Policy Optimization \(GRPO\)\-based reinforcement\-learning post\-training framework that trains a shared base LLM using system\-level supply\-chain rewards\. GRPO post\-training substantially reduces tail events, curtails agent bullwhip, and improves the reliability of autonomous supply\-chain agents\.
## 1Introduction111The paper expands and provides more technical details on the concepts and framework described in a recent article, Long, C\., Simchi\-Levi, D\., Calmon, A\. P\., & Calmon, F\. P\. When supply chains become autonomous\. Harvard Business Review\(Longet al\.,[2025a](https://arxiv.org/html/2605.17036#bib.bib37)\)\.
Experts suggest that a fully autonomous supply chain, where AI makes all inventory and logistics decisions, may be close at hand\. They predict that autonomous supply chains will soon deliver significant gains in productivity, efficiency, and responsiveness\. Fueling this excitement is the rapid progress in Large Language Models \(LLMs\), the engine behind Generative AI \(GenAI\), which can now handle tasks ranging from procurement decisions and revenue management to logistics\. So far, however, most LLM applications in supply chains have focused on narrow tasks, using a single model to enhance a specific function, such as demand forecasting or replenishment decisions\(Menacheet al\.,[2025](https://arxiv.org/html/2605.17036#bib.bib38)\)\. The broader vision is more ambitious: a fully autonomous supply chain in which multiple LLM agents collaborate, each executing distinct responsibilities across the network\. But how close are we to this reality? To find out, we reimagined the Beer Game—a classic supply chain management simulation developed in the 1960s and used by countless management education programs—by replacing every human player with a GenAI agent\. This setup allows us to analyze the supply\-chain management capabilities of GenAI agents and the impact of lead time, information sharing, and financial constraints on agent performance\. Our self\-contained supply\-chain simulation serves as a testbed for benchmarking GenAI agents against human experts, anticipating integration challenges, and identifying strategies for using this technology effectively in supply\-chain management\.
We show that making an autonomous, GenAI\-managed supply chain work depends on mastering four critical levers: the model you choose, the policies and guardrails you set, the information you share, and the instructions you give\. We benchmarked AI performance against that of humans: we used data from 12 Georgia Tech cohorts with more than 100 students in total who played the Beer Game over the past three years, all operating under the same system conditions as the GenAI testbed\. In our best\-performing setup—using Llama 4 Maverick 17B with optimized prompts, data\-sharing rules, and guardrails—the AI agents reduced average costs across 30 replications of the game by as much as 67% relative to the student teams\.
Of course, average performance alone is insufficient to justify operational deployment in real\-world applications\. Supply chain practitioners must evaluate not only expected cost outcomes but also system reliability\. An autonomous policy that yields low average costs but occasionally produces highly volatile procurement, production, or ordering decisions is practically unviable\. This is especially true in multi\-echelon networks, where localized errors can rapidly propagate, compounding single deviations into distorted demand signals for upstream agents over time\.
The reliability issue of autonomous agents is especially salient because the underlying LLMs are stochastic, which can produce inconsistent decisions over repeated runs of the same prompt\. Our analysis reveals that while out\-of\-the\-box GenAI agents can be efficient on average, i\.e\., average cost is low relative to that of human decision makers, they remain unreliable: occasional ordering decisions can lead to high total supply chain costs\. We show that training agents with synthetic data dramatically increases reliability while maintaining a significant cost advantage over human decision makers\.
To capture the unreliability risk in using LLM agents for decision\-making, we introduce the concept of the*agent bullwhip*effect: the amplification of decision instability and unreliability across runs in multi\-agent systems\. This instability manifests along two dimensions\. First, at a fixed point in time, decision variance increases across facilities as one moves upstream: retailers tend to produce relatively stable orders, while wholesalers, distributors, and factories exhibit progressively larger dispersion and more severe tail decisions\. Second, within the same facility, decision variance can grow over time as early ordering differences alter inventory positions, backlogs, and shipment pipelines, causing small behavioral deviations to compound through delayed feedback, resulting in growing risks of erratic ordering policy over time\. Our theoretical analysis shows that this phenomenon is not an incidental artifact of a particular model or decoding procedure, but an inherent risk in multi\-agent systems where autonomous agents coordinate through delayed and partial information\. In such settings, collaboration and lead times create feedback channels that propagate and amplify decision unreliability\. This motivates moving beyond inference\-time fixes, such as repeated sampling, toward reinforcement\-learning post\-training\. In particular, we propose a GRPO\-based framework that trains a shared base model using system\-level supply\-chain rewards, enabling agents to internalize coordinated replenishment policies that reduce both cross\-facility and intertemporal decision variance\.
These findings point toward a future in which AI handles routine operational decisions while offering cost efficiency and flexible availability, thereby creating capacity for human experts to pursue higher\-level strategic challenges in supply chains\.
The main contributions of this paper are as follows:
1. 1\.We use the GenAI Beer Game as a controlled testbed for autonomous supply\-chain decision\-making and benchmark LLM agents against human teams\. The results identify four inference\-time levers that shape effectiveness: model selection, guardrails, centralized data sharing, and prompt engineering\. Model capability is the dominant lever: reasoning models exceed human\-level performance out of the box, while non\-reasoning models require additional constraints, orchestration, and prompting to close the gap\. In our best\-performing AI setup, GenAI agents reduce costs by up to 67% relative to human teams\.
2. 2\.We show that strong average performance can mask substantial reliability risk\. We introduce*agent bullwhip*: the amplification of run\-to\-run decision instability in multi\-echelon autonomous systems\. Empirically, this instability appears along two dimensions: decision variance increases across facilities as one moves upstream, and decision variance within the same facility can grow over time\. Operationally, upstream agents are exposed not only to larger orders, but also to more volatile decisions and more severe tail outcomes\.
3. 3\.We evaluate repeated sampling as a training\-free mitigation strategy and find that it is ineffective\. Although majority voting over multiple samples is commonly used to reduce model stochasticity, it does not meaningfully reduce agent bullwhip in our setting\. This indicates that the instability is not merely incidental decoding noise; it reflects policy\-level unreliability that can continue to propagate through the supply\-chain network\.
4. 4\.We develop a mathematical framework that separates demand\-driven and decision\-driven order variability\. Using a transfer\-function analysis, we show that external demand shocks and agent\-level decision shocks are transmitted through the same delayed replenishment feedback loop\. The framework explains why reliability failures can arise even when average cost performance appears strong, and why decision instability is a structural risk in multi\-agent systems with information delays and decentralized coordination\.
5. 5\.We propose a GRPO\-based post\-training framework for adapting LLM agents to the supply\-chain task\. The framework trains a shared base LLM using system\-level supply\-chain rewards, enabling agents to learn coordinated policies during training while still being deployed as independent decision\-makers with limited local visibility at test time\. GRPO post\-training substantially curtails agent bullwhip, reduces tail events, and improves both the reliability and efficiency of autonomous supply\-chain agents\.
## 2Setup and Related Work
### 2\.1The Beer Game: A Timeless Lesson in Supply Chain Dynamics
The Beer Distribution Game is a canonical system\-dynamics environment for studying feedback, delay, and boundedly rational decision\-making in supply chains\(Forrester,[1961](https://arxiv.org/html/2605.17036#bib.bib15); Sterman,[1989](https://arxiv.org/html/2605.17036#bib.bib16)\)\. Its enduring lesson is that local decisions can generate system\-level instability even when each participant is trying to behave rationally\. The Beer Game is therefore a useful testbed for autonomous AI agents because it turns a simple decision interface into a dynamic coordination problem with delayed feedback\.
In the Beer Game, four players operate a simple, serial supply chain with a retailer, a wholesaler, a distributor, and a factory\. Each week, every player makes one decision: how much to order from their upstream partner\. The setup is straightforward, but the constraints are revealing\. Players must balance the cost of holding excess inventory against the penalty for backorders—unfulfilled orders that must be shipped later\. The structure of the beer supply chain complicates this central trade\-off\. Players operate in silos and cannot communicate, and only the retailer sees the actual end\-customer demand\. Significant built\-in delays exist for both orders and shipments, and it typically takes two weeks for a shipment to arrive\. This creates "pipeline inventory"—beer that has been ordered but is not yet on hand—which many players fail to account for\. The players’ shared goal is to meet demand at the lowest possible total cost for the entire supply chain\.
This structure produces the classical bullwhip effect: order signals become distorted and amplified as demand information moves upstream\. A small, temporary fluctuation in customer demand creates wild swings in upstream agents’ orders\. A longstanding literature studies the bullwhip effect as the upstream amplification of order variability caused by information distortion in supply chains\. Lee et al\.\(Leeet al\.,[1997b](https://arxiv.org/html/2605.17036#bib.bib17),[a](https://arxiv.org/html/2605.17036#bib.bib18)\)introduce the phenomenon and identify key drivers, including demand signal processing, order batching, price fluctuations, and rationing behavior\. Chen et al\.\(Chenet al\.,[2000a](https://arxiv.org/html/2605.17036#bib.bib41)\)quantify how forecasting rules, lead times, and information sharing affect the magnitude of bullwhip in a simple supply chain, and show that exponential smoothing forecasts can further amplify order variability depending on lead times and forecasting parameters\(Chenet al\.,[2000b](https://arxiv.org/html/2605.17036#bib.bib1)\)\. Building on this literature, we study an agent\-driven bullwhip from the perspective of across\-run reliability: autonomous LLM agents can introduce decision variability even when demand paths and operational states are held fixed\.
### 2\.2GenAI in Supply Chains and the Beer Game
Recent work on Generative AI \(GenAI\) in supply chain management studies language models for demand forecasting, procurement support, replenishment planning, and managerial decision support\(Menacheet al\.,[2025](https://arxiv.org/html/2605.17036#bib.bib38)\)\. A related stream examines LLM\-based or foundation\-model agents for inventory management and autonomous supply chains, emphasizing natural\-language interfaces, interpretable coordination, and flexible decision support across organizational boundaries\(Simchi\-Leviet al\.,[2025a](https://arxiv.org/html/2605.17036#bib.bib12); Quan and Liu,[2024](https://arxiv.org/html/2605.17036#bib.bib24); Xuet al\.,[2024a](https://arxiv.org/html/2605.17036#bib.bib25),[b](https://arxiv.org/html/2605.17036#bib.bib26); Zhenget al\.,[2025](https://arxiv.org/html/2605.17036#bib.bib8)\)\. Much of this literature, however, focuses on specialized applications or architectures that assume substantial system design around the model\.
Our setting is closer to how firms are likely to deploy frontier models: through standard interfaces, with limited ability to retrain closed\-weight models\. We therefore ask whether off\-the\-shelf GenAI agents can manage a dynamic multi\-echelon supply chain when each agent controls one role in the Beer Game\.
The setup of this paper is based on a recent implementation of an AI\-powered version of the Beer Game\(Longet al\.,[2025b](https://arxiv.org/html/2605.17036#bib.bib40),[a](https://arxiv.org/html/2605.17036#bib.bib37)\)\. The GenAI Beer Game is similar to the classic version, but replaces human players with LLM agents \(e\.g\., GPT\-5\)\. Each agent takes on a single role, such as the wholesaler, and makes ordering decisions autonomously\. Like human players, AI agents manage inventory, respond to downstream orders, and submit upstream orders\. In contrast to many AI benchmarks that test a single LLM’s performance on a task, the GenAI Beer Game examines the agents’ ability to coordinate as a group\.
In Section[3](https://arxiv.org/html/2605.17036#S3), we focus on "inference\-time methods" to improve the effectiveness of off\-the\-shelf LLMs\. We consider approaches that optimize how these models are used rather than changing the models themselves\. Inference\-time methods include crafting better instruction prompts for the agents, orchestrating information flow between agents, and designing simple rules or policies that limit what actions agents can take\. Unlike existing work\(Boussiouxet al\.,[2025](https://arxiv.org/html/2605.17036#bib.bib3)\)that studies prompting\-based reasoning interventions for improving LLM decisions in the Beer Game, we focus on fully autonomous agent deployment, evaluating both inference\-time methods and a reinforcement\-learning post\-training framework\.
### 2\.3Scaling Test\-Time Compute
We consider scaling test\-time compute in the form of repeated sampling to enhance the reliability of the LLM agents and reduce tail risks\. Our inference\-time analysis connects to work on test\-time compute, where LLM performance is improved by sampling multiple candidate responses, applying self\-consistency or voting, or allocating more inference compute to harder instances\(Wanget al\.,[2022](https://arxiv.org/html/2605.17036#bib.bib39); Wuet al\.,[2024](https://arxiv.org/html/2605.17036#bib.bib31); Snellet al\.,[2024](https://arxiv.org/html/2605.17036#bib.bib33); Brownet al\.,[2024](https://arxiv.org/html/2605.17036#bib.bib32)\)\. These methods are attractive because they are computationally efficient, requiring no post\-training of LLMs\. A Beer Game decision differs from a static response to a self\-contained prompt: it is a context\-dependent decision whose consequences propagate through inventories, backlogs, and shipment pipelines\. This distinction motivates our empirical test of repeated sampling in Section[4\.3](https://arxiv.org/html/2605.17036#S4.SS3)\. If instability primarily reflects decoding noise, aggregation should stabilize decisions; if it reflects a weakness in the underlying policy, improving reliability requires a stronger intervention\.
### 2\.4Multi\-agent Collaboration
Multi\-agent collaboration has long been studied in supply\-chain management\. Early agent\-based work treated supply chains as networks of autonomous but interdependent entities and focused on modeling, coordination, and decision support\. Swaminathan et al\.\(Swaminathanet al\.,[1998](https://arxiv.org/html/2605.17036#bib.bib19)\)develop a reusable multi\-agent framework in which supply\-chain models are built from agent types, control elements, and interaction protocols\. Fox et al\.\(Foxet al\.,[2000](https://arxiv.org/html/2605.17036#bib.bib20)\)propose an agent\-oriented architecture for tactical and operational supply\-chain management, emphasizing autonomous software agents that coordinate through communication protocols\. Nissen et al\.\(Nissen,[2001](https://arxiv.org/html/2605.17036#bib.bib21)\)study intelligent agents for supply\-chain integration, focusing on agents that conduct business on behalf of buyers, vendors, and users\. Julka et al\.\(Julkaet al\.,[2002](https://arxiv.org/html/2605.17036#bib.bib22)\)apply an agent\-based decision\-support framework to refinery supply\-chain management, where coordination across departments and dynamic data sources is central\.
More recent work extends this tradition using learning\-based and foundation\-model approaches\. Multi\-agent reinforcement\-learning studies train decentralized or partially decentralized policies for inventory control and transshipment, often using centralized training to improve coordination while preserving decentralized execution\(Kotecha and del Rio Chanona,[2025](https://arxiv.org/html/2605.17036#bib.bib27); Kimet al\.,[2024](https://arxiv.org/html/2605.17036#bib.bib23)\)\. A parallel stream explores LLM\-based agents for supply\-chain tasks: Quan et al\.\(Quan and Liu,[2024](https://arxiv.org/html/2605.17036#bib.bib24)\)introduce InvAgent, a zero\-shot LLM\-based multi\-agent system for inventory management; Jannelli et al\.\(Jannelliet al\.,[2026](https://arxiv.org/html/2605.17036#bib.bib14)\)study autonomous LLM agents for consensus\-seeking in supply\-chain coordination; and Xu et al\.\(Xuet al\.,[2024b](https://arxiv.org/html/2605.17036#bib.bib26)\)examine autonomous supply chains through a multi\-agent systems approach\. Our work builds on these streams but shifts the focus from modeling architectures, average performance, or task automation to reliability\. We study autonomous LLM agents whose decisions interact dynamically through inventories, backlogs, orders, and shipment pipelines, and show that these interactions can amplify decision instability across repeated runs even when demand paths and operational states are held fixed\.
### 2\.5Reinforcement Learning Post\-training for Reliable Agents
The reliability problem also connects to reinforcement\-learning post\-training for LLMs\. Policy\-gradient methods such as Proximal Policy Optimization provide a general framework for improving stochastic policies through reward feedback\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.17036#bib.bib34)\)\. Group Relative Policy Optimization \(GRPO\) replaces a learned value critic with relative comparisons across sampled outputs, making it attractive when reward evaluation is easier than value estimation\(Shaoet al\.,[2024](https://arxiv.org/html/2605.17036#bib.bib35)\)\. Recent reasoning models show that reinforcement learning can induce more reliable behavior when rewards are aligned with the task objective\(Guoet al\.,[2025](https://arxiv.org/html/2605.17036#bib.bib36)\)\. In supply chains, rewards are operationally observable through holding cost, backlog cost, and total system cost\. Therefore, the Beer Game provides a useful post\-training environment in which local actions have delayed consequences and system\-level rewards measure coordination across echelons\.
### 2\.6Sample\-Path Reliability and Tail Risk in Autonomous Supply Chains
There is growing interest in evaluating decision\-making policies beyond expected performance, including their distributional behavior, tail outcomes, and realized sample paths\(Simchi\-Leviet al\.,[2025b](https://arxiv.org/html/2605.17036#bib.bib4),[2023](https://arxiv.org/html/2605.17036#bib.bib5); Zhu and Simchi\-Levi,[2026](https://arxiv.org/html/2605.17036#bib.bib6)\)\. This perspective is especially important in supply chains, where decisions are sequential, coupled across facilities, and exposed to demand uncertainty and lead\-time delays\. A policy with low expected cost may still generate unacceptable realized trajectories, such as inventory depletion, backlog accumulation, excessive order volatility, or persistent service failures\.
This concern has roots in robust and risk\-aware inventory theory\. Classical distribution\-free inventory models protect against poor realizations when the demand distribution is only partially known\(Scarf,[1958](https://arxiv.org/html/2605.17036#bib.bib42); Gallego and Moon,[1993](https://arxiv.org/html/2605.17036#bib.bib43)\), while robust optimization approaches extend this logic to dynamic inventory and supply\-chain systems\(Bertsimas and Thiele,[2006](https://arxiv.org/html/2605.17036#bib.bib44)\)\. A parallel literature in sequential decision\-making studies tail\-sensitive and constraint\-aware policies, including CVaR optimization\(Rockafellar and Uryasev,[2000](https://arxiv.org/html/2605.17036#bib.bib45); Chowet al\.,[2015](https://arxiv.org/html/2605.17036#bib.bib46)\), safe reinforcement learning\(García and Fernández,[2015](https://arxiv.org/html/2605.17036#bib.bib47)\), and constrained policy optimization\(Achiamet al\.,[2017](https://arxiv.org/html/2605.17036#bib.bib48)\)\. Our setting differs from this work in that reliability risk is generated not only by exogenous demand uncertainty or model misspecification, but also by stochastic LLM decisions that interact through multi\-echelon feedback\.
This paper brings the sample\-path perspective to autonomous AI agents\. Standard evaluations of LLM agents often emphasize average task performance or aggregate cost, but supply\-chain deployment also requires consistency across repeated executions of the same operational environment\. Because LLM decisions are stochastic, the same state can induce different actions across runs; in a multi\-echelon network, those differences can propagate upstream and compound over time\. We refer to this amplification of run\-to\-run decision instability as the agent bullwhip effect\. It links sample\-path reliability to the structure of the supply chain: reliability risk is not only a property of an individual model response, but also an emergent property of delayed, decentralized coordination\. This framing motivates both our empirical reliability analysis and our use of post\-training to learn more stable supply\-chain policies\.
## 3Inference\-time Methods: Lessons Learned Using GenAI as Autonomous Agents without Training
In this section, we evaluate various types of large language models \(LLMs\) and examine inference\-time methods that can improve the performance of LLM agents in supply\-chain decision\-making\. Here, inference\-time strategies reflect how most people interact with GenAI agents: by submitting natural\-language queries\. Because LLMs are trained on heterogeneous data sources, they can generate meaningful responses with little or no task\-specific customization\. This raises two central questions: Can these models be deployed “as is” to manage complex supply\-chain tasks effectively? If not, are there generalizable inference\-time strategies that firms can use to steer GenAI models toward better performance?
We first note the generational gap between earlier and more recent LLMs\. Recent models are increasingly equipped with reasoning capabilities that enable them to decompose complex decision problems into smaller, more tractable steps and use explicit intermediate reasoning to guide action selection\. This capability is particularly relevant in supply\-chain settings, where local ordering decisions interact dynamically with inventory positions, backlogs, delays, and upstream amplification\. We report the performance of various LLMs in the Beer Game and benchmark them against human teams\. Our results show that more advanced models with stronger reasoning capabilities substantially outperform earlier model generations\. At the same time, earlier models can perform effectively when supported by appropriate inference\-time interventions, particularly curated information sharing and coordination through a central orchestrator\.
Our experiments identify four inference\-time levers that are critical to the autonomous use of GenAI in supply chains: \(1\) model selection, \(2\) policies and guardrails, \(3\) data sharing through a centralized orchestrator, and \(4\) prompt engineering\. The most important lever is the selection of a capable model\. However, even the most advanced models require careful deployment and guidance\. Performance can be substantially improved by sharing the right data through a centralized orchestrator to enhance coordination and by using appropriately designed prompts\. Together, these four levers determine whether the integration of autonomous agents succeeds or fails\.
Using all four levers, we demonstrate that GenAI agents outperform human teams operating under identical conditions, providing evidence for their potential integration into real\-world supply\-chain operations\.
### 3\.1Generational Leap of LLMs
The choice of LLM used for the agents is the single most important determinant of performance because an agent’s underlying reasoning capability directly affects supply chain costs and system stability\. Less advanced models can amplify system noise into costly bullwhip effects, whereas more advanced models can attenuate such amplification\. To evaluate reliability, we conducted multiple identical Beer Game runs for each model\. In our decentralized setup—that is, a setting in which no information is shared across agents—we found that many earlier\-generation models were highly inefficient, producing pronounced bullwhip effects and generating costs an order of magnitude higher than those of human teams\. These models were also unreliable: across identical runs, total costs varied substantially, ranging from 13% to 46% of the mean\.
More concerningly, some models failed to follow instructions, leading to systemic breakdowns\. In our trials, models such as Microsoft’s Phi\-4 and DeepSeek\-R1\-0528 violated basic ordering rules in more than 25% of cases\.
However, recent models with advanced reasoning capabilities demonstrate a clear improvement in performance\. For example, upgrading agents from GPT\-4o mini to GPT\-5 mini reduced total supply chain costs by 70%\. Similarly, the newer and more lightweight Llama 4 Maverick 17B model substantially outperformed its much larger predecessor, Llama 3\.3 70B, reducing costs by 82%, although its results remained unstable\. These findings indicate that firms should prioritize reasoning ability and instruction\-following when selecting a model; subsequent interventions should be viewed as performance optimizations\.
### 3\.2Policies and Guardrails to Limit Costly Errors
Policies and guardrails are most valuable not because they make agents inherently better at solving the underlying decision problem, but because they prevent agents from taking high\-cost actions when their reasoning or forecasts fail\. Constraints on an agent’s range of possible actions can therefore materially improve both efficiency and reliability\. This is particularly important in supply\-chain settings, where simple guardrails can prevent panic\-induced over\-ordering that triggers costly bullwhip cascades across upstream suppliers\. In our experiments, a simple budget constraint proved especially effective\. Each agent was assigned a fixed budget, and orders were not allowed to exceed available funds\. This hard constraint operates as a brake on panic buying: when an agent experiences a stockout and attempts to place an excessively large order, the budget limit forces a more measured response, reducing demand amplification and limiting the propagation of shocks upstream\.
The effects were substantial: total costs decreased by 25% for GPT\-5 mini, 39% for GPT\-4o mini, and 41% for Llama 4 Maverick 17B\. For capable but less stable models such as Llama 4 Maverick 17B, the improvement in reliability was also pronounced, with cross\-run variation in performance declining from 46% to 37%\. These findings suggest that, once a sufficiently capable model has been selected, firms should adopt targeted operational policies to limit erratic behavior\. A hard budget constraint represents a particularly high\-leverage intervention, as it directly restricts the excessive ordering behavior that contributes to instability\.
### 3\.3Information Orchestration: Share Curated Data Through a Central Orchestrator
To evaluate how information sharing affects agent performance, we introduced a central “orchestrator”: an agent with full visibility across the supply chain and responsibility for sharing specific, curated information with the agents participating in the game\. This design reflects an important distinction between human and LLM\-based decision\-making\. Information that is useful to human teams may distract an AI agent, resulting in poorer decisions and higher costs\. Accordingly, we evaluated two information\-sharing strategies in which the orchestrator shared information but did not make decisions\. The results indicate that more data is not necessarily better\.
In the first scenario, the orchestrator shared only real\-time customer demand\. When agents received the current week’s customer demand, performance improved across all models\. Total costs decreased by approximately 18% for GPT\-5 mini, 25% for Llama 4 Maverick 17B, and 38% for GPT\-4o mini\.
In the second scenario, agents received both demand history and analysis\. Specifically, we provided a five\-week demand history and volatility analysis\. The results were mixed\. This richer information significantly improved performance for less advanced models, with costs for GPT\-4o mini decreasing by 69%\. However, for more advanced models, the additional information appeared to act as a distraction, and performance was worse than when agents received only real\-time demand\.
Notably, other data elements that typically benefit human decision makers—such as inventory position or pipeline inventory—provided limited benefits and often exacerbated the bullwhip effect\. These findings suggest that firms should be selective and empirically test which data are shared with AI agents\. For more advanced models, less information is often more effective\.
### 3\.4Prompt Design
Because LLMs are probabilistic systems, task framing matters\. Prompt design can substantially improve the performance of less advanced models, although it may provide limited benefits for more advanced models\. In our experiments, reframing the objective—the instruction provided to the LLM—from the general goal of “minimize total costs” to the more specific objective of “minimize the weighted average of backlog and holding costs” generated large gains for less advanced models\. This prompt revision reduced costs by 44% for GPT\-4o mini and 33% for GPT\-4\.1 mini\. For more advanced models, the effect was negligible\.
These results indicate that prompt design should be used as a secondary performance lever\. It can unlock meaningful improvements in less advanced models, but it is not a substitute for strong reasoning capabilities, robust guardrails, and curated data sharing\.
### 3\.5Summary: Reasoning Models and Critical Design Levers Achieve Above\-Human Performance
Benchmarking against human performance is a critical test for autonomous systems\. Demonstrating competitiveness with trained professionals validates the potential for integrating AI agents into real\-world supply chain operations\. This competitive performance, together with AI agents’ inherent advantages in cost efficiency and continuous availability, provides a clear motivation for adoption\.
Our most significant finding is that GenAI agents powered by state\-of\-the\-art models can manage a supply chain at a level of proficiency that not only rivals, but can exceed, that of human experts\. We benchmarked multiple GenAI agent configurations against historical performance data from 12 Georgia Tech cohorts, comprising more than 100 students operating the same Beer Game\. The results were striking\. As illustrated in Figure[1](https://arxiv.org/html/2605.17036#S3.F1)\(Left\), earlier\-generation models exhibit suboptimal performance out of the box, roughly doubling the supply\-chain costs achieved by humans\. However, when provided with appropriate information through an orchestrator, they can outperform human teams by reducing costs by 32%\.
More advanced models with reasoning capabilities, by contrast, achieve competitive performance even when deployed out of the box\. As shown in Figure[1](https://arxiv.org/html/2605.17036#S3.F1)\(Left\), GPT\-5 mini achieved a 33% cost reduction relative to human teams when operated out of the box\. Most notably, when these models are optimized using additional strategic levers—specifically enhanced information sharing and policy constraints such as budget limitations—AI agents powered by GPT\-5 mini and Llama 4 Maverick 17B achieved 50% to 67% reductions in total supply\-chain costs relative to their human counterparts, as shown in Figure[1](https://arxiv.org/html/2605.17036#S3.F1)\(Right\)\.


Figure 1:Comparisons of AI setups against human teams in supply\-chain cost performance\. The out\-of\-the\-box reasoning model \(left, GPT\-5 mini\) exceeded human\-level performance, while non\-reasoning models \(GPT\-4o mini\) required policy constraints, orchestration, and prompt engineering to close the gap with humans\. Optimized with the same techniques, reasoning models \(right, GPT\-5 mini and Llama 4 Maverick 17B\) achieved up to a 67% cost reduction relative to human teams in the MIT Beer Game\.This finding has important implications\. It demonstrates that autonomous AI agents are already capable of handling the complex, dynamic decision\-making required for core supply chain functions\. By delegating such operational tasks to reliable GenAI agents, human managers can redirect their attention from day\-to\-day operational routines toward higher\-value activities, such as strategic network design, supplier relationship management, navigating major disruptions, and breaking down the functional silos that currently separate supply chain, finance, sales, and trade\. In this setting, the role of the supply chain professional evolves from operator to strategist\.
Overall, these findings suggest that GenAI agents, when deployed with state\-of\-the\-art models and appropriate configuration, can achieve decision\-making quality comparable to that of human experts\. This represents a pivotal opportunity to accelerate the transition toward autonomous supply chains and to redeploy human expertise toward more strategic and creative supply chain challenges\.
Although the results in Figure[1](https://arxiv.org/html/2605.17036#S3.F1)are encouraging, they report average costs over identical simulation runs and therefore obscure a reliability question that is central to real\-world deployment: do agents perform consistently well across runs? To examine this issue, Table[1](https://arxiv.org/html/2605.17036#S3.T1)reports detailed statistics across 30 runs of each AI setup, including mean total supply\-chain cost, standard deviation, and coefficient of variation\. The results reveal substantial variability\. In particular, GPT\-5 mini and Llama 4 Maverick 17B exhibit high coefficients of variation, ranging from 37% to 46% of the mean\. This level of run\-to\-run instability poses a significant operational risk: even when an agent performs well on average, firms may still face occasional but costly failures that disrupt inventory planning, amplify upstream orders, and undermine trust in autonomous supply\-chain management\. The issue of reliability is addressed through reinforcement\-learning post\-training of the underlying LLM in the next section\.
Table 1:Supply Chain Cost and Coefficient of Variation Across Runs by Model Type
## 4Reliability Issues of Autonomous AI Agents: the Agent Bullwhip Effect
As demonstrated in the preceding sections, autonomous GenAI agents, when optimally configured, can achieve strong average performance in supply\-chain settings, potentially surpassing human teams\. However, mean performance alone is insufficient to justify operational deployment in real\-world applications\. Supply\-chain practitioners must evaluate not only expected cost outcomes but also system consistency, robustness, and exposure to tail risks\. An autonomous policy that yields low expected costs but occasionally produces highly volatile procurement, production, or replenishment decisions is practically unviable\.
This concern becomes particularly salient when the same autonomous supply\-chain system is run repeatedly under identical conditions\. Because LLMs are inherently probabilistic, even advanced models generate different decisions across identical runs\. In our experiments, we ran LLM\-powered GenAI agents on the same GenAI Beer Game environment across 30 repeated trials\. If the agents used a stable inventory policy, one would expect these repeated runs to produce similar ordering trajectories\. Instead, as shown in Table[1](https://arxiv.org/html/2605.17036#S3.T1), there is significant run\-to\-run variation in agents’ orders, even though the environment, prompts, and system structure are all held fixed\. This variability should not be dismissed as a mere technical artifact of language models; rather, it constitutes a critical operational vulnerability\.
The risk is especially pronounced in multi\-echelon networks, where localized ordering errors can propagate rapidly and compound over time, transforming isolated deviations into distorted demand signals for upstream agents\. Instability at a single node can therefore cascade through the network, generating costly tail outcomes for the supply chain as a whole\.
### 4\.1Agent Bullwhip Effect
As described in Section[2](https://arxiv.org/html/2605.17036#S2), the classical bullwhip effect refers to the amplification of order quantities as one moves upstream in the supply chain\. Letqk,t\(r\)q\_\{k,t\}^\{\(r\)\}denote the order placed by facilitykkin periodttduring simulation runrr\. The classical bullwhip effect can be expressed as an increase in order variance across echelons:
ℬi\(r\)=Vart\(qk,t\(r\)\)Vart\(qk−1,t\(r\)\)\>1\.\\mathcal\{B\}\_\{i\}^\{\(r\)\}=\\frac\{\\operatorname\{Var\}\_\{t\}\\\!\\left\(q\_\{k,t\}^\{\(r\)\}\\right\)\}\{\\operatorname\{Var\}\_\{t\}\\\!\\left\(q\_\{k\-1,t\}^\{\(r\)\}\\right\)\}\>1\.In the Beer Game, a modest change in customer demand can therefore turn into much larger order swings at the wholesaler, distributor, and factory\. Our findings suggest that when autonomous LLM agents make these decisions, there is a second layer of amplification\. Not only do order levels amplify upstream, but the dispersion and tail risk of the decision itself also amplify across otherwise identical runs\.
###### Definition 1\(Agent bullwhip\)\.
Consider a discrete\-time serial supply chain withnnechelons indexed byk=1,…,nk=1,\\dots,n, where tier0represents external customer demand\. Consider repeated runsr=1,…,Rr=1,\\ldots,Rof the same supply\-chain environment under an identical demand path, system configuration, and agent setup\. Letqk,t\(r\)q\_\{k,t\}^\{\(r\)\}denote the order placed by echelonkkin periodttduring runrr\. Define the run\-to\-run variance of echelonkk’s order in periodttas
σk,t2=Varr\(qk,t\(r\)\)\.\\sigma\_\{k,t\}^\{2\}=\\operatorname\{Var\}\_\{r\}\\\!\\left\(q\_\{k,t\}^\{\(r\)\}\\right\)\.
We say that*agent bullwhip*occurs when decision instability, measured across repeated runs, is amplified by the supply\-chain system\. This amplification can manifest along two dimensions: across echelons within a fixed time period, and over time within a fixed echelon\. First, run\-to\-run decision variance increases upstream at a fixed periodtt:
σk,t2\>σk−1,t2\.\\sigma\_\{k,t\}^\{2\}\>\\sigma\_\{k\-1,t\}^\{2\}\.Equivalently, for adjacent echelons, define
Ψk\(t\)=Varr\(qk,t\(r\)\)Varr\(qk−1,t\(r\)\)\.\\Psi\_\{k\}\(t\)=\\frac\{\\operatorname\{Var\}\_\{r\}\\\!\\left\(q\_\{k,t\}^\{\(r\)\}\\right\)\}\{\\operatorname\{Var\}\_\{r\}\\\!\\left\(q\_\{k\-1,t\}^\{\(r\)\}\\right\)\}\.A value ofΨk\(t\)\>1\\Psi\_\{k\}\(t\)\>1indicates that run\-to\-run decision variance is amplified as orders move from echeloni−1i\-1to echelonii\. Because fixed\-demand experiments can have zero run\-to\-run variance at tier0, we avoid normalizing byVarr\(q0,t\(r\)\)\\operatorname\{Var\}\_\{r\}\(q\_\{0,t\}^\{\(r\)\}\)\. Instead, when adjacent denominators are positive, cumulative adjacent\-tier amplification can be summarized by
Cj\(t\)=∏k=1jΨk\(t\)\.C\_\{j\}\(t\)=\\prod\_\{k=1\}^\{j\}\\Psi\_\{k\}\(t\)\.
Second,*intertemporal agent bullwhip*occurs when run\-to\-run decision variance increases over time within the same echelon:
σk,t\+12\>σk,t2\.\\sigma\_\{k,t\+1\}^\{2\}\>\\sigma\_\{k,t\}^\{2\}\.Equivalently, define the within\-echelon amplification ratio
Φk\(t\)=Varr\(qk,t\+1\(r\)\)Varr\(qk,t\(r\)\),\\Phi\_\{k\}\(t\)=\\frac\{\\operatorname\{Var\}\_\{r\}\\\!\\left\(q\_\{k,t\+1\}^\{\(r\)\}\\right\)\}\{\\operatorname\{Var\}\_\{r\}\\\!\\left\(q\_\{k,t\}^\{\(r\)\}\\right\)\},withΦk\(t\)\>1\\Phi\_\{k\}\(t\)\>1indicating that decision instability accumulates over time for facilityii\.
Figure 2:Agent bullwhip: order variability across agents and time\.For each week and facility, the colored box captures the middle 50% of orders across repeated runs, the center line denotes the median, the whiskers show the non\-outlier range beyond the interquartile range, and circles represent outlier orders\. The amplification of decision unreliability across echelons manifests along two dimensions: decision variance increases across facilities at a fixed point in time and within each facility over time\.
### 4\.2Agent Bullwhip in Action
To demonstrate the agent bullwhip effect in practice, we analyze order quantities across 30 runs of the GenAI Beer Game driven by Qwen\-3 4B under the same demand path, game structure, and prompt\. We plot the resulting order variability across facilities and weeks in Figure[2](https://arxiv.org/html/2605.17036#S4.F2)\. Because demand is identical across runs, the dispersion in the figure reflects conditional decision instability rather than demand variability\. Across all weeks, the retailer exhibits minor volatility in order quantities, but this variation becomes substantially larger for the wholesaler and most pronounced for the distributor and the factory\.
We can analyze the agent bullwhip along*two dimensions*, as run\-to\-run order variance compounds across agents and time\. First, holding a given week fixed, such as Week 15 in Figure[2](https://arxiv.org/html/2605.17036#S4.F2), order variability is barely visible for the retailer but increases as one moves upstream to the wholesaler, distributor, and factory\. Second, agent bullwhip accumulates intertemporally within a fixed facility\. For a given facility, such as the distributor shown in green, order variability increases over time as early decision shocks perturb inventory positions, backlogs, and pipeline inventories\. These perturbed states then feed into later ordering decisions, allowing small differences across runs to persist and compound through lead times and delayed feedback\. Thus, even under identical customer demand paths and system configurations, instability compounds both across agents and over time\.
This result is counter to a common intuition about autonomous AI agents: agents do not become more reliable merely by interacting with the environment and making repeated decisions\. The second dimension of agent bullwhip, order variability over time within a fixed facility, shows that decisions can become increasingly unstable across repeated interactions, even when the environment and agent configuration are held fixed and cost feedback is provided\. This suggests that explicit intervention mechanisms are needed to support reliable autonomous supply\-chain decision\-making\.
The traditional bullwhip effect captures amplification in realized order patterns\. It shows that, in a given game, upstream nodes may order more aggressively than downstream nodes in response to demand shocks\. However, it does not capture whether an agent’s decisions are stable across repeated exposure to the same environment\. Agent bullwhip captures this missing component\. The variance of decisions across repeated runs reflects uncertainty in the task as perceived and processed by the agent\. When that variance grows upstream and over time, it indicates that the autonomous system is not merely reacting strongly to demand; it is becoming less reliable in how it interprets and responds to the same operational state\.
This result raises an important caution: a supply chain governed by autonomous LLM agents may appear effective when judged by average cost alone, yet remain operationally fragile if it exhibits high decision variability\. In practice, reliability is a first\-order performance criterion\. Firms are unlikely to delegate planning decisions to an agent that occasionally makes unreasonable choices, even if its average performance is strong\.
Firms must be able to trust that a planning system will produce similar recommendations when faced with the same inputs, particularly in settings where supply\-chain decisions create downstream financial commitments and upstream production responses\. The presence of agent bullwhip suggests that reliability risk is endogenous to the multi\-agent supply\-chain structure: uncertainty is not only present at the individual\-agent level, but is also amplified by the network and by time delays\. This creates a new class of tail risk for autonomous operations and complicates the case for deploying off\-the\-shelf LLM agents without additional safeguards or specialized training\.
The central challenge, therefore, is to identify mechanisms that improve agent reliability\. We begin with a training\-free approach commonly used to reduce unreliability arising from model stochasticity: repeated sampling at test time\.
### 4\.3Test\-Time Interventions Fail to Improve Reliability


Figure 3:Effect of repeated sampling on agent bullwhip\.The top panel reports results in which each order decision is determined by majority vote over 10 independent samples, while the bottom panel uses 100 samples\. Increasing test\-time sampling does not reduce run\-to\-run variability, indicating that decision instability requires policy\-level intervention, such as reinforcement\-learning post\-training of LLM agents\.A common approach in the computer science literature to address model unreliability at test time, without model post\-training, is to introduce redundancy by drawing multiple outputs and aggregating them, often via majority voting\(Wanget al\.,[2022](https://arxiv.org/html/2605.17036#bib.bib39)\)\. If order instability were driven primarily by random decoding noise, such ensembling should induce policy convergence and stabilize performance\. To test this hypothesis, we compare the default single\-sample baseline \(Figure[2](https://arxiv.org/html/2605.17036#S4.F2)\) with majority\-voting schemes based on 10 and 100 samples\.
Figure[3](https://arxiv.org/html/2605.17036#S4.F3)shows that repeated sampling fails to mitigate unreliability\. Substantial run\-to\-run variation persists, and tail events remain pronounced\. This reflects a deeper issue: instability arises from suboptimal decision policies rather than incidental randomness\. Off\-the\-shelf models lack a stable inventory\-control policy and exhibit systematic errors, such as overreacting to backlogs or neglecting pipeline inventory\. When the model is structurally uncertain, additional samples can simply reproduce the same deficient reasoning\.
More broadly, test\-time ensembling is insufficient for dynamic operational settings\. While redundancy can reduce incidental noise, supply\-chain management requires structured decision rules that account for delayed information and intertemporal trade\-offs\.
In the next section, we formalize this intuition by showing that repeated sampling cannot eliminate agent bullwhip in multi\-echelon systems\. Achieving reliability instead requires modifying the underlying policy\. We therefore turn to reinforcement\-learning post\-training to learn stable and coordinated inventory decisions\.
## 5A Theoretical Model for Agent Bullwhip
In this section, we explain how the agent bullwhip effect arises and why repeated sampling fails to mitigate it\. We do so by decomposing order variability into two channels: demand\-driven amplification and decision\-driven amplification\. A central insight is that the agent bullwhip effect is not specific to any particular LLM, prompt, or decoding procedure; rather, it is an inherent risk in multi\-agent systems involving lead times, information delays, and decentralized coordination\. This perspective also clarifies why repeated sampling is insufficient\. Majority voting may reduce some model\-level randomness at the point of decision, but any residual variability that remains can still be amplified upstream by the multi\-agent structure\.
One might ask how this decomposition relates to the two empirical manifestations of agent bullwhip: increasing order variance across facilities at a fixed point in time and increasing variance within the same facility over time\. The decomposition explains both patterns\. At a fixed point in time, variability can grow across facilities because upstream tiers inherit both demand fluctuations and decision noise introduced by downstream agents\. Within a fixed facility, variability can grow over time because early decision shocks perturb inventories, backlogs, and pipeline positions, which then affect future orders through lead times and delayed feedback\. Thus, the observed increase in variance across both tiers and periods can be understood through the variance decomposition: it reflects a multi\-agent feedback system that amplifies both demand\-driven and decision\-driven uncertainty\.
We first distinguish two sources of randomness that contribute to the agent bullwhip effect\. The first is the external demand pathD=\{Dt\}t≥0D=\\\{D\_\{t\}\\\}\_\{t\\geq 0\}\. The second is the decision\-shock processϵ=\{ϵk,t\}\\epsilon=\\\{\\epsilon\_\{k,t\}\\\}, which captures run\-specific variation in the order\-up\-to target chosen by tierkk\. We consider a discrete\-time serial supply chain withnnechelons indexed byk=1,…,nk=1,\\dots,n, where tier0represents external customer demand andq0,t=Dtq\_\{0,t\}=D\_\{t\}\. When conditioning on a realized demand path, we writeD=dD=d\.
The distinction matters because the classical bullwhip effect concerns the propagation of demand uncertainty, whereas autonomous agents also introduce decision uncertainty even when the demand path and operational state are held fixed\. In LLM decision\-making, decision shock persists even if some ensembling layer is added \(e\.g\., best\-of\-nnsampling\)\. The law of total variance separates these two channels\. For each tier and period,
Var\(qk,t\)=VarD\(𝔼ϵ\[qk,t∣D\]\)⏟Vk,tD\+𝔼D\[Varϵ\(qk,t∣D\)\]⏟Vk,tϵ\.\\operatorname\{Var\}\(q\_\{k,t\}\)=\\underbrace\{\\operatorname\{Var\}\_\{D\}\\\!\\left\(\\mathbb\{E\}\_\{\\epsilon\}\[q\_\{k,t\}\\mid D\]\\right\)\}\_\{V^\{D\}\_\{k,t\}\}\+\\underbrace\{\\mathbb\{E\}\_\{D\}\\\!\\left\[\\operatorname\{Var\}\_\{\\epsilon\}\(q\_\{k,t\}\\mid D\)\\right\]\}\_\{V^\{\\epsilon\}\_\{k,t\}\}\.\(1\)The first term,Vk,tDV^\{D\}\_\{k,t\}, is the demand\-driven component: it measures how external demand fluctuations propagate upstream after averaging over the agent’s internal randomness\. The second term,Vk,tϵV^\{\\epsilon\}\_\{k,t\}, is the decision\-driven component: it measures run\-to\-run instability generated by autonomous agents when the demand path is held fixed\. We call upstream amplification ofVk,tDV^\{D\}\_\{k,t\}the*demand bullwhip*, and upstream amplification ofVk,tϵV^\{\\epsilon\}\_\{k,t\}the*decision bullwhip*\.
The section proceeds in three steps\. First, we introduce a general operational model that includes on\-hand inventory, backlog, outstanding inventory, shipment constraints, nonnegative ordering, and decision shocks\. Second, we specialize this model to a linear benchmark system that admits exact transfer\-function analysis for both demand and decision shocks\. Third, within the linear benchmark system, we analyze agent bullwhip by decomposing order variability into demand\-driven and decision\-driven components\. The appendix then returns to the nonlinear operational setting through simulation\.
### 5\.1A General Operational Inventory Model
Each tierk=1,…,nk=1,\\dots,noperates under an order\-up\-to policy with deterministic lead timeℓk≥0\\ell\_\{k\}\\geq 0\. For each tierkk, define the following state variables:
- •OHk,tOH\_\{k,t\}: on\-hand inventory at the beginning of periodtt,
- •Bk,tB\_\{k,t\}: backlog owed by tierkkto tierk−1k\-1,
- •Ok,tO\_\{k,t\}: outstanding inventory, i\.e\., orders already placed by tierkkbut not yet received,
- •IPk,tIP\_\{k,t\}: inventory position\.
The inventory position is defined as
IPk,t=OHk,t\+Ok,t−Bk,t,IP\_\{k,t\}=OH\_\{k,t\}\+O\_\{k,t\}\-B\_\{k,t\},\(2\)where inventory position counts physical inventory currently on hand, plus outstanding inventory, minus outstanding backlog\.
##### Demand prediction and order\-up\-to\-level\.
Each tier forms an exponentially smoothed forecast of downstream orders observed through periodt−1t\-1, following the timing convention in\(Chenet al\.,[2000b](https://arxiv.org/html/2605.17036#bib.bib1)\):
q^k,t=λkqk−1,t−1\+\(1−λk\)q^k,t−1,λk∈\(0,1\]\.\\hat\{q\}\_\{k,t\}=\\lambda\_\{k\}q\_\{k\-1,t\-1\}\+\(1\-\\lambda\_\{k\}\)\\hat\{q\}\_\{k,t\-1\},\\qquad\\lambda\_\{k\}\\in\(0,1\]\.\(3\)
The corresponding order\-up\-to target is
Sk,t=θkq^k,t\+ϵk,t,S\_\{k,t\}=\\theta\_\{k\}\\hat\{q\}\_\{k,t\}\+\\epsilon\_\{k,t\},\(4\)whereϵk,t\\epsilon\_\{k,t\}denotes the tier’s decision shock: a safety\-stock perturbation, an idiosyncratic residual in the agent policy, or a run\-specific interpretation of the same state\.θk\\theta\_\{k\}is a multiplier that controls the inventory target level\. Under periodic review, the order\-up\-to level is typically set to cover expected demand over the protection interval, i\.e\., the replenishment lead time plus the review period\. Hence, when the review period is one period andq^k,t\\hat\{q\}\_\{k,t\}is a one\-period demand forecast, a common choice isθk=ℓk\+1\\theta\_\{k\}=\\ell\_\{k\}\+1\(Silveret al\.,[1998](https://arxiv.org/html/2605.17036#bib.bib2); Chenet al\.,[2000b](https://arxiv.org/html/2605.17036#bib.bib1)\)\.
Thus, the order placed by tierkkis given by:
qk,t=\[Sk,t−IPk,t\]\+,q\_\{k,t\}=\\left\[S\_\{k,t\}\-IP\_\{k,t\}\\right\]^\{\+\},\(5\)where\[x\]\+:=max\{x,0\}\[x\]^\{\+\}:=\\max\\\{x,0\\\}\. The positive\-part operator captures the practical constraint that order quantities cannot be negative\.
This timing convention means tierkkplaces its period\-ttorder using information observed through periodt−1t\-1\. The current downstream orderqk−1,tq\_\{k\-1,t\}then enters the inventory\-position update after the period\-ttorder is placed\. Under this convention, the exact transfer function contains a leading one\-period delay\. Removing that pure delay gives the classical exponential\-smoothing bullwhip filter used for variance\-gain calculations, because the lag operator has unit modulus in the frequency domain\.
##### System dynamics\.
Letrk,tr\_\{k,t\}denote inbound receipts to tierkk, and letsk,ts\_\{k,t\}denote shipments from tierkkto tierk−1k\-1\. Fork<nk<n, receipts at tierkkare delayed shipments from tierk\+1k\+1:rk,t=sk\+1,t−ℓkr\_\{k,t\}=s\_\{k\+1,t\-\\ell\_\{k\}\}\. For the most upstream tier, we assume access to an outside supplier with unlimited capacity:rn,t=qn,t−ℓnr\_\{n,t\}=q\_\{n,t\-\\ell\_\{n\}\}\.
The effective demand faced by tierkkis the sum of current downstream orders and existing backlog:
Δk,t=qk−1,t\+Bk,t\.\\Delta\_\{k,t\}=q\_\{k\-1,t\}\+B\_\{k,t\}\.\(6\)Available inventory at tierkkis
Ak,t=OHk,t\+rk,t\.A\_\{k,t\}=OH\_\{k,t\}\+r\_\{k,t\}\.\(7\)Actual shipments are constrained by available inventory:
sk,t=min\{Ak,t,Δk,t\}\.s\_\{k,t\}=\\min\\left\\\{A\_\{k,t\},\\Delta\_\{k,t\}\\right\\\}\.\(8\)
The operational state dynamics are therefore
OHk,t\+1\\displaystyle OH\_\{k,t\+1\}=OHk,t\+rk,t−sk,t,\\displaystyle=OH\_\{k,t\}\+r\_\{k,t\}\-s\_\{k,t\},\(9\)Bk,t\+1\\displaystyle B\_\{k,t\+1\}=Bk,t\+qk−1,t−sk,t,\\displaystyle=B\_\{k,t\}\+q\_\{k\-1,t\}\-s\_\{k,t\},\(10\)Ok,t\+1\\displaystyle O\_\{k,t\+1\}=Ok,t\+qk,t−rk,t\.\\displaystyle=O\_\{k,t\}\+q\_\{k,t\}\-r\_\{k,t\}\.\(11\)
The following proposition shows that the detailed operational state equations imply a simple inventory\-position balance\.
###### Proposition 1\(Inventory\-position recursion\)\.
Under the operational dynamics above, the inventory position of tierkksatisfies
IPk,t\+1=IPk,t\+qk,t−qk−1,t\.IP\_\{k,t\+1\}=IP\_\{k,t\}\+q\_\{k,t\}\-q\_\{k\-1,t\}\.\(12\)
The nonlinearity in \([5](https://arxiv.org/html/2605.17036#S5.E5)\) makes the full model analytically difficult\. We therefore next study a linear benchmark model that preserves the central feedback mechanism while allowing closed\-form characterization\.
### 5\.2Linear Benchmark Model
We now introduce a tractable benchmark system\.
###### Assumption 1\(Linear benchmark system\)\.
For the analytical benchmark, assume:
1. 1\.orders are not truncated at zero;
2. 2\.lead times are deterministic\.
Under Assumption[1](https://arxiv.org/html/2605.17036#Thmassumption1), the ordering rule becomes linear:
qk,t=θkq^k,t\+ϵk,t−IPk,t,q\_\{k,t\}=\\theta\_\{k\}\\hat\{q\}\_\{k,t\}\+\\epsilon\_\{k,t\}\-IP\_\{k,t\},\(13\)where the forecast remains as
q^k,t=λkqk−1,t−1\+\(1−λk\)q^k,t−1\.\\hat\{q\}\_\{k,t\}=\\lambda\_\{k\}q\_\{k\-1,t\-1\}\+\(1\-\\lambda\_\{k\}\)\\hat\{q\}\_\{k,t\-1\}\.\(14\)
Let the customer demand\{Dt\}t≥0\\\{D\_\{t\}\\\}\_\{t\\geq 0\}be i\.i\.d\. with mean zero and varianceσD2\\sigma\_\{D\}^\{2\}, and suppose all initial conditions are deterministic\. Since our focus is variance amplification, centering the demand process entails no loss of generality\.
The reduced linear benchmark is therefore
qk,t\\displaystyle q\_\{k,t\}=θkq^k,t\+ϵk,t−IPk,t,\\displaystyle=\\theta\_\{k\}\\hat\{q\}\_\{k,t\}\+\\epsilon\_\{k,t\}\-IP\_\{k,t\},\(15\)IPk,t\+1\\displaystyle IP\_\{k,t\+1\}=IPk,t\+qk,t−qk−1,t,\\displaystyle=IP\_\{k,t\}\+q\_\{k,t\}\-q\_\{k\-1,t\},\(16\)q^k,t\\displaystyle\\hat\{q\}\_\{k,t\}=λkqk−1,t−1\+\(1−λk\)q^k,t−1\.\\displaystyle=\\lambda\_\{k\}q\_\{k\-1,t\-1\}\+\(1\-\\lambda\_\{k\}\)\\hat\{q\}\_\{k,t\-1\}\.\(17\)
##### The lag operator\.
We work with discrete time series indexed byt∈ℤt\\in\\mathbb\{Z\}, and use the*lag operator*ℒ\\mathcal\{L\}to express linear dynamics compactly\. For any time seriesxtx\_\{t\}, the lag operator shifts the index back by one period,
ℒxt=xt−1\.\\mathcal\{L\}\\,x\_\{t\}=x\_\{t\-1\}\.Powers ofℒ\\mathcal\{L\}iterate this shift,ℒjxt=xt−j\\mathcal\{L\}^\{j\}x\_\{t\}=x\_\{t\-j\}, and a polynomial \(or rational function\) ofℒ\\mathcal\{L\}acts onxtx\_\{t\}in the obvious way; for example,\(1−ℒ\)xt=xt−xt−1\(1\-\\mathcal\{L\}\)x\_\{t\}=x\_\{t\}\-x\_\{t\-1\}\. Throughout the analysis, equalities involvingℒ\\mathcal\{L\}are understood to hold for alltt\.
##### The order transfer function\.
We now derive the transfer representation that maps downstream orders and local decision shocks into the tier\-kkorder\.
###### Proposition 2\(One\-tier transfer function with decision shocks\)\.
Under the linear benchmark system, the order process at tierkksatisfies
qk,t\+1=\(1\+θkλk\)qk−1,t−\(θkλk\+1−λk\)qk−1,t−1\+\(1−λk\)qk,t\+ϵk,t\+1−\(2−λk\)ϵk,t\+\(1−λk\)ϵk,t−1\.q\_\{k,t\+1\}=\(1\+\\theta\_\{k\}\\lambda\_\{k\}\)q\_\{k\-1,t\}\-\(\\theta\_\{k\}\\lambda\_\{k\}\+1\-\\lambda\_\{k\}\)q\_\{k\-1,t\-1\}\+\(1\-\\lambda\_\{k\}\)q\_\{k,t\}\+\\epsilon\_\{k,t\+1\}\-\(2\-\\lambda\_\{k\}\)\\epsilon\_\{k,t\}\+\(1\-\\lambda\_\{k\}\)\\epsilon\_\{k,t\-1\}\.\(18\)
Equivalently, in lag\-operator form,
𝒒k=Hk\(ℒ\)𝒒k−1\+G\(ℒ\)ϵk,\{\\bm\{q\}\}\_\{k\}=H\_\{k\}\(\\mathcal\{L\}\)\{\\bm\{q\}\}\_\{k\-1\}\+G\(\\mathcal\{L\}\)\{\\bm\{\\epsilon\}\}\_\{k\},\(19\)where𝐪k=\{qk,t\}t≥0\{\\bm\{q\}\}\_\{k\}=\\\{q\_\{k,t\}\\\}\_\{t\\geq 0\},𝐪k−1=\{qk−1,t\}t≥0\{\\bm\{q\}\}\_\{k\-1\}=\\\{q\_\{k\-1,t\}\\\}\_\{t\\geq 0\},ϵk=\{ϵk,t\}t≥0\{\\bm\{\\epsilon\}\}\_\{k\}=\\\{\\epsilon\_\{k,t\}\\\}\_\{t\\geq 0\},ℒxt=xt−1\\mathcal\{L\}x\_\{t\}=x\_\{t\-1\}, and
Hk\(ℒ\)=\(1\+θkλk\)ℒ−\(θkλk\+1−λk\)ℒ21−\(1−λk\)ℒ=ℒ\[1\+θkλk\(1−ℒ\)1−\(1−λk\)ℒ\]\.H\_\{k\}\(\\mathcal\{L\}\)=\\frac\{\(1\+\\theta\_\{k\}\\lambda\_\{k\}\)\\mathcal\{L\}\-\(\\theta\_\{k\}\\lambda\_\{k\}\+1\-\\lambda\_\{k\}\)\\mathcal\{L\}^\{2\}\}\{1\-\(1\-\\lambda\_\{k\}\)\\mathcal\{L\}\}=\\mathcal\{L\}\\left\[1\+\\frac\{\\theta\_\{k\}\\lambda\_\{k\}\(1\-\\mathcal\{L\}\)\}\{1\-\(1\-\\lambda\_\{k\}\)\\mathcal\{L\}\}\\right\]\.\(20\)The local decision\-shock filter is
G\(ℒ\)=1−ℒ\.G\(\\mathcal\{L\}\)=1\-\\mathcal\{L\}\.\(21\)
### 5\.3Agent Bullwhip and Decomposition
We now characterize how the order variance changes as orders propagate upstream\. In the stationary benchmark, we drop the time subscript from the variance components and write
VkD:=VarD\(𝔼ϵ\[qk,t∣D\]\),Vkϵ:=𝔼D\[Varϵ\(qk,t∣D\)\]\.V^\{D\}\_\{k\}:=\\operatorname\{Var\}\_\{D\}\\\!\\left\(\\mathbb\{E\}\_\{\\epsilon\}\[q\_\{k,t\}\\mid D\]\\right\),\\qquad V^\{\\epsilon\}\_\{k\}:=\\mathbb\{E\}\_\{D\}\\\!\\left\[\\operatorname\{Var\}\_\{\\epsilon\}\(q\_\{k,t\}\\mid D\)\\right\]\.To make the two components explicit, define the demand\-driven conditional mean
q¯k,t\(D\):=𝔼ϵ\[qk,t∣D\],𝒒¯k=\{q¯k,t\(D\)\}t≥0,\\bar\{q\}\_\{k,t\}\(D\):=\\mathbb\{E\}\_\{\\epsilon\}\[q\_\{k,t\}\\mid D\],\\qquad\\bar\{\\bm\{q\}\}\_\{k\}=\\\{\\bar\{q\}\_\{k,t\}\(D\)\\\}\_\{t\\geq 0\},and the centered decision\-driven deviation
xk,t\(D\):=qk,t−q¯k,t\(D\),𝒙k=\{xk,t\(D\)\}t≥0\.x\_\{k,t\}\(D\):=q\_\{k,t\}\-\\bar\{q\}\_\{k,t\}\(D\),\\qquad\{\\bm\{x\}\}\_\{k\}=\\\{x\_\{k,t\}\(D\)\\\}\_\{t\\geq 0\}\.ThusVkD=VarD\(q¯k,t\(D\)\)V^\{D\}\_\{k\}=\\operatorname\{Var\}\_\{D\}\(\\bar\{q\}\_\{k,t\}\(D\)\), whileVkϵ=𝔼D\[Varϵ\(xk,t\(D\)∣D\)\]V^\{\\epsilon\}\_\{k\}=\\mathbb\{E\}\_\{D\}\[\\operatorname\{Var\}\_\{\\epsilon\}\(x\_\{k,t\}\(D\)\\mid D\)\]\.
For ease of explanation, we impose the following assumption:
###### Assumption 2\(Heterogeneous lower bounds and independent inputs\)\.
For each tierkk, the order\-up\-to multiplier and smoothing parameter satisfy
θk≥θ\>0,λk∈\[λ,1\],\\theta\_\{k\}\\geq\\theta\>0,\\qquad\\lambda\_\{k\}\\in\[\\lambda,1\],whereλ\>0\\lambda\>0\. The demand process is centered and independent across time:
𝔼\[Dt\]=0,Dt⟂Dsfort≠s,\\mathbb\{E\}\[D\_\{t\}\]=0,\\qquad D\_\{t\}\\perp D\_\{s\}\\quad\\text\{for \}t\\neq s,with
Var\(Dt\)≥σD2\>0for allt\.\\operatorname\{Var\}\(D\_\{t\}\)\\geq\\sigma\_\{D\}^\{2\}\>0\\qquad\\text\{for all \}t\.The decision shocks are centered, independent across tiers and time, and independent of demand:
𝔼\[ϵk,t\]=0,ϵk,t⟂ϵj,sunless\(k,t\)=\(j,s\),\\mathbb\{E\}\[\\epsilon\_\{k,t\}\]=0,\\qquad\\epsilon\_\{k,t\}\\perp\\epsilon\_\{j,s\}\\quad\\text\{unless \}\(k,t\)=\(j,s\),and
Var\(ϵk,t\)≥σk2\>0for allk,t\.\\operatorname\{Var\}\(\\epsilon\_\{k,t\}\)\\geq\\sigma\_\{k\}^\{2\}\>0\\qquad\\text\{for all \}k,t\.
#### 5\.3\.1Demand bullwhip
Taking the conditional expectation of \([19](https://arxiv.org/html/2605.17036#S5.E19)\) overϵ\\epsilongives the demand\-channel recursion
𝒒¯k=Hk\(ℒ\)𝒒¯k−1,𝒒¯0=𝑫\.\\bar\{\\bm\{q\}\}\_\{k\}=H\_\{k\}\(\\mathcal\{L\}\)\\bar\{\\bm\{q\}\}\_\{k\-1\},\\qquad\\bar\{\\bm\{q\}\}\_\{0\}=\{\\bm\{D\}\}\.\(22\)Therefore𝒒¯k=\(∏r=1kHr\(ℒ\)\)𝑫\\bar\{\\bm\{q\}\}\_\{k\}=\\left\(\\prod\_\{r=1\}^\{k\}H\_\{r\}\(\\mathcal\{L\}\)\\right\)\{\\bm\{D\}\}, with the product ordered from downstream to upstream\.
###### Theorem 1\(Demand bullwhip\)\.
Suppose Assumption[2](https://arxiv.org/html/2605.17036#Thmassumption2)holds\. In the linear benchmark, the demand\-driven component satisfies
VkD≥σD2Γk,V^\{D\}\_\{k\}\\geq\\sigma\_\{D\}^\{2\}\\Gamma^\{k\},where
Γ=1\+2θλ\+2θ2λ22−λ\>1\.\\Gamma=1\+2\\theta\\lambda\+\\frac\{2\\theta^\{2\}\\lambda^\{2\}\}\{2\-\\lambda\}\>1\.
Whenθk=θ\\theta\_\{k\}=\\theta,λk=λ\\lambda\_\{k\}=\\lambda, and demands are i\.i\.d\. with varianceσD2\\sigma\_\{D\}^\{2\}, Theorem[1](https://arxiv.org/html/2605.17036#Thmtheorem1)gives the common\-parameter white\-noise benchmark\. The theorem shows the demand bullwhip: uncertainty in customer demand is amplified as it moves upstream through the replenishment rule, and the demand\-driven component grows at least exponentially in the tier index\. This bullwhip effect is consistent with those documented in the literature\(Chenet al\.,[2000a](https://arxiv.org/html/2605.17036#bib.bib41),[b](https://arxiv.org/html/2605.17036#bib.bib1)\)\.
#### 5\.3\.2Decision Bullwhip
We next isolate the decision\-bullwhip component\. Conditioning onD=dD=dremoves demand randomness, so the only remaining variation is the agent’s run\-to\-run decision uncertainty\. The customer tier has no decision shock, soV0ϵ=0V^\{\\epsilon\}\_\{0\}=0\.
Subtracting \([22](https://arxiv.org/html/2605.17036#S5.E22)\) from \([19](https://arxiv.org/html/2605.17036#S5.E19)\) gives the decision\-channel recursion
𝒙k=Hk\(ℒ\)𝒙k−1\+G\(ℒ\)ϵk,𝒙0=0\.\{\\bm\{x\}\}\_\{k\}=H\_\{k\}\(\\mathcal\{L\}\)\{\\bm\{x\}\}\_\{k\-1\}\+G\(\\mathcal\{L\}\)\{\\bm\{\\epsilon\}\}\_\{k\},\\qquad\{\\bm\{x\}\}\_\{0\}=0\.\(23\)Thus the decision component is built up from local shocks injected throughGGand then propagated upstream by the tier\-specific filtersHkH\_\{k\}\.
###### Theorem 2\(Decision bullwhip\)\.
Suppose Assumption[2](https://arxiv.org/html/2605.17036#Thmassumption2)holds\. In the linear benchmark model, the decision\-driven component satisfies
Vkϵ≥2∑j=1kσj2Γk−j,V^\{\\epsilon\}\_\{k\}\\geq 2\\sum\_\{j=1\}^\{k\}\\sigma\_\{j\}^\{2\}\\Gamma^\{k\-j\},\(24\)where
Γ=1\+2θλ\+2θ2λ22−λ\>1\.\\Gamma=1\+2\\theta\\lambda\+\\frac\{2\\theta^\{2\}\\lambda^\{2\}\}\{2\-\\lambda\}\>1\.
Theorem[2](https://arxiv.org/html/2605.17036#Thmtheorem2)characterizes the decision bullwhip: even when the demand path is fixed, local agent\-level decision noise accumulates across tiers and is amplified by the same upstream feedback loop that drives the classical demand bullwhip\. In the common\-parameter benchmark with comparable decision\-shock variances, the lower bound becomes a geometric accumulation term; for example, ifσj2=σϵ2\\sigma\_\{j\}^\{2\}=\\sigma\_\{\\epsilon\}^\{2\}for alljj, then
Vkϵ≥2σϵ2∑m=0k−1Γm\.V^\{\\epsilon\}\_\{k\}\\geq 2\\sigma\_\{\\epsilon\}^\{2\}\\sum\_\{m=0\}^\{k\-1\}\\Gamma^\{m\}\.
###### Corollary 1\(Exponential growth from any downstream decision noise\)\.
Suppose Assumption[2](https://arxiv.org/html/2605.17036#Thmassumption2)holds\. In the linear benchmark model, if there exists a fixed tierj0j\_\{0\}such that
σj02\>0,\\sigma\_\{j\_\{0\}\}^\{2\}\>0,then, for allk≥j0k\\geq j\_\{0\},
Vkϵ≥2σj02Γk−j0\.V^\{\\epsilon\}\_\{k\}\\geq 2\\sigma\_\{j\_\{0\}\}^\{2\}\\Gamma^\{k\-j\_\{0\}\}\.
Corollary[1](https://arxiv.org/html/2605.17036#Thmcorollary1)suggests that any nonzero decision noise source at a fixed downstream tier generates exponential variance growth as it propagates upstream\.
#### 5\.3\.3Discussion
##### A two\-facet perspective\.
Theorem[1](https://arxiv.org/html/2605.17036#Thmtheorem1)and Theorem[2](https://arxiv.org/html/2605.17036#Thmtheorem2)make the same structural point from two complementary directions\. Demand uncertainty and decision uncertainty are distinct components of order variability, but both are transmitted through the same upstream replenishment dynamics\. In the heterogeneous\-parameter setting, this transmission is governed by the tier\-specific filtersHkH\_\{k\}\. Thus the classical demand bullwhip and the agent\-driven decision bullwhip are not separate mechanisms; they are two inputs to the same delayed feedback system\.
This distinction is important for intervention design\. Classical bullwhip mitigation mechanisms, such as demand smoothing, order coordination, and information sharing, primarily target the demand\-driven componentVkDV^\{D\}\_\{k\}\. They can reduce the variability inherited from external demand, but they do not eliminate the decision\-driven componentVkϵV^\{\\epsilon\}\_\{k\}, which is generated by the agent’s own policy even after conditioning on a fixed demand path\. In the context of LLM agents, this means that two runs with the same demand realization and the same operational state may still produce different order\-up\-to targets, and these residual differences are then propagated upstream by the supply\-chain feedback loop\.
##### Accumulation of decision noise\.
Our results also imply that decision bullwhip can dominate in regimes where agents are tuned for prediction stability\. When the smoothing parameterλ\\lambdais small, the one\-tier gain satisfies
Γ=1\+2θλ\+O\(λ2\),\\Gamma=1\+2\\theta\\lambda\+O\(\\lambda^\{2\}\),so demand\-driven amplification can be relatively mild over a moderate number of tiers\. By contrast, the decision\-driven component still accumulates across tiers\. If the decision variances are of comparable magnitude, for exampleσj2≍σϵ2\\sigma\_\{j\}^\{2\}\\asymp\\sigma\_\{\\epsilon\}^\{2\}, then
Vkϵ≳2σϵ2∑m=0k−1Γm\.V^\{\\epsilon\}\_\{k\}\\;\\gtrsim\\;2\\sigma\_\{\\epsilon\}^\{2\}\\sum\_\{m=0\}^\{k\-1\}\\Gamma^\{m\}\.In the small\-λ\\lambdaregime, this behaves approximately like2σϵ2k2\\sigma\_\{\\epsilon\}^\{2\}kover moderate lead times\. Thus even when the propagation gain is close to one, residual agent\-level randomness can accumulate across the supply chain and become a major source of order variability\.
These results explain why repeated sampling is an incomplete remedy\. Majority voting or best\-of\-nnsampling may reduce the local decision varianceσj2\\sigma\_\{j\}^\{2\}, but unless it drives this variance very close to zero, the remaining decision noise continues to enter the feedback system and propagate upstream\. The structural source of the agent bullwhip is therefore not merely stochasticity at a single decision point; it is the*interaction*between residual decision variability, lead times, information delays, and decentralized replenishment decisions inherent in autonomous supply chains\.
##### Fixed\-tier accumulation over time\.
Our results describe how decision variability grows across tiers\. A complementary time\-domain implication is that, for a fixed facility, decision unreliability also accumulates over time in the finite\-horizon linear benchmark\.
###### Proposition 3\(Intertemporal accumulation of decision unreliability\)\.
Fix a demand pathD=dD=dand consider the finite\-horizon linear benchmark initialized from deterministic initial conditions, with zero shock pre\-history\. Assume the decision shocks are centered, independent across tiers and time, and satisfy
Var\(ϵj,t\)=σj2<∞\.\\operatorname\{Var\}\(\\epsilon\_\{j,t\}\)=\\sigma\_\{j\}^\{2\}<\\infty\.For a fixed facilitykk, define
Wk,t\(d\):=Var\(qk,t∣D=d\)\.W\_\{k,t\}\(d\):=\\operatorname\{Var\}\(q\_\{k,t\}\\mid D=d\)\.Then
Wk,t\+1\(d\)≥Wk,t\(d\)for allt\.W\_\{k,t\+1\}\(d\)\\geq W\_\{k,t\}\(d\)\\qquad\\text\{for all \}t\.
##### Summary\.
In practical deployments of LLM agents for multi\-echelon supply\-chain management, the observed agent bullwhip effect should therefore be understood as the combined outcome of demand\-driven and decision\-driven amplification\. The demand channel captures how external uncertainty propagates through the system, while the decision channel captures how run\-to\-run instability in agent policies is generated and amplified even under a fixed demand path\. Addressing the latter requires changing the decision policy itself, rather than merely averaging over its outputs\. This motivates our subsequent approach: training supply\-chain\-specialized agents using Group Relative Policy Optimization \(GRPO\), which directly targets decision behavior by learning more stable and coordinated inventory\-management policies\.
## 6Training Supply\-Chain\-Specialized Agents with Group Relative Policy Optimization \(GRPO\)
The theoretical results above indicate that agent bullwhip is not an incidental artifact of a particular model or decoding procedure, but an inherent risk in multi\-agent systems with lead times and decentralized coordination\. This interpretation is reinforced by the failure of repeated sampling: if instability were driven primarily by random decoding noise, then aggregating multiple samples should have produced a more stable policy\. Instead, its limited effect suggests that the problem lies deeper, in the absence of a learned decision policy that can reliably coordinate across echelons and optimize system\-level outcomes\.
This pattern is consistent with the nature of the task\. Off\-the\-shelf, general\-purpose LLMs exhibit broad linguistic competence and strong generalized reasoning, but they are not explicitly trained to internalize the dynamics of inventory replenishment, including lead times, delayed feedback loops, multi\-echelon coordination, and system\-wide cost trade\-offs\. This challenge is analogous to robotics, where models often require task\-specific training before they can operate reliably in unfamiliar environments\. In the supply\-chain setting, the absence of such specialization leads to erratic heuristic responses, substantial run\-to\-run variability, and severe tail risks\.
The natural way to turn a general\-purpose LLM into an inventory\-management agent is reinforcement\-learning post\-training\. In this framing, the LLM is a stochastic policy whose input is the supply\-chain state observed by a player and whose output is that player’s ordering decision\. Post\-training refines the LLM’s parameters so that an agent using the LLM produces effective decisions on the trajectories the policy itself induces\.
This motivates the central question: can reinforcement\-learning post\-training transform a general\-purpose LLM from a capable but volatile decision\-maker into a reliable inventory\-management agent? More specifically, can feedback from realized supply\-chain performance induce a replenishment policy that is both cost\-effective and robust across runs?
Applying standard reinforcement\-learning algorithms to the Beer Game is difficult\. Most modern policy\-gradient methods, such as actor\-critic algorithms, rely on a learned value function that maps states to expected discounted returns, and several features of this environment make such a value function unreliable: each agent’s state includes inventory, backlog, pipeline, and recent orders; each agent sees only its local state and the orders and shipments from its immediate neighbors; rewards are delayed by lead times; and the relevant horizon spans many weeks\.
We therefore use Group Relative Policy Optimization \(GRPO\), which avoids learning an explicit value function\. At each training step, GRPO samples a group of trajectories under the current policy and computes a baseline from the realized costs of the group\. Trajectories that outperform the baseline are reinforced; trajectories that underperform are discouraged\. This relative comparison provides a stable learning signal in a high\-dimensional, multi\-agent, delayed\-feedback environment where value estimation would be unreliable\.
Our GRPO\-based framework trains a single shared LLM backbone across the four tiers using system\-level rewards, so that the shared model learns how local ordering decisions interact across echelons\. After post\-training, we deploy independent instances of this shared backbone at each tier, with each agent acting on its own local state but using the shared learned policy\. The trained LLM therefore carries a global view of how every tier should behave under system\-level cost, even though each deployed instance acts only on local information\. This is the centralized\-training, decentralized\-execution paradigm from multi\-agent reinforcement learning: coordination is built into the policy through training rather than into runtime communication or orchestration\.
##### Environment and setup\.
We consider the GenAI Beer Game described in Section[2](https://arxiv.org/html/2605.17036#S2), a four\-echelon supply chain consisting of a retailer, wholesaler, distributor, and factory\. All four agents share the same LLM policyπθ\\pi\_\{\\theta\}but operate at different positions in the supply chain\. The system evolves overTTweeks\. Every week, every LLM\-powered agent observes its local state \(inventory, backlog, incoming shipments, etc\.\) and outputs an order quantity\.
Training is performed over multiple simulated episodes\. Each episode corresponds to one full beer game trajectory of lengthTT\. For each training step, we runGGindependent episodes \(rollouts\) using the current policy\.
##### Demand curriculum\.
To expose the model to diverse dynamics, we train under synthetic demand distributions:
- •Curriculum 1 \(Poisson\):Demand is drawn i\.i\.d\. fromPoisson\(λ\)\\mathrm\{Poisson\}\(\\lambda\), whereλ∼𝒰\(5,20\)\\lambda\\sim\\mathcal\{U\}\(5,20\)is resampled per episode\.
- •Curriculum 2 \(Truncated Normal\):Demand is drawn i\.i\.d\. from a truncated normal distribution:Dt∼TruncNormal\(μ,σ2;\[0,50\]\),μ∼𝒰\(8,20\),σ∼𝒰\(2,6\)D\_\{t\}\\sim\\mathrm\{TruncNormal\}\(\\mu,\\sigma^\{2\};\[0,50\]\),\\quad\\mu\\sim\\mathcal\{U\}\(8,20\),\\ \\sigma\\sim\\mathcal\{U\}\(2,6\)\.
In practice, training may use either distribution or a curriculum that switches between them across training steps\.
Cost structure and reward signals\.
Letck,tc\_\{k,t\}denote the cost incurred by agentk∈\{retailer,wholesaler,distributor,factory\}k\\in\\\{\\text\{retailer\},\\text\{wholesaler\},\\text\{distributor\},\\text\{factory\}\\\}at weektt\. This cost consists of holding and backorder components:
ck,t=choldIk,t\+cbackBk,t\.c\_\{k,t\}=c\_\{\\mathrm\{hold\}\}I\_\{k,t\}\+c\_\{\\mathrm\{back\}\}B\_\{k,t\}\.Here,Ik,tI\_\{k,t\}andBk,tB\_\{k,t\}denote on\-hand inventory and backlog, respectively\. Aggregating across agents, the total system cost at weekttisctsys=∑act\(a\)c\_\{t\}^\{\\mathrm\{sys\}\}=\\sum\_\{a\}c\_\{t\}^\{\(a\)\}, and the cumulative system cost over an episode acrossTTweeks is
Csys=∑t=1Tctsys\.C^\{\\mathrm\{sys\}\}=\\sum\_\{t=1\}^\{T\}c\_\{t\}^\{\\mathrm\{sys\}\}\.Similarly, each agent incurs a cumulative cost:
Ck=∑t=1Tck,t\.C\_\{k\}=\\sum\_\{t=1\}^\{T\}c\_\{k,t\}\.
The reward signal is defined along two dimensions: reward scope and reward attribution\. The reward scope determines whether performance is measured at the system level or at the level of individual agents\. Under a system\-level objective, the reward is given byr=−Csysr=\-C^\{\\mathrm\{sys\}\}, so that all agents share a common signal reflecting total supply chain efficiency\. Under an agent\-level objective, each agentaais evaluated using its own cumulative cost,rk=−Ckr\_\{k\}=\-C\_\{k\}, which emphasizes decentralized performance\.
The reward attribution determines how costs are assigned to individual decisions over time\. Under episode\-level attribution, a single scalar reward is assigned uniformly to all actions taken within a trajectory, so that each decision receives the same signalrr\(orrkr\_\{k\}\)\. In contrast, under rollout \(return\-to\-go\) attribution, each decision taken at weekttis assigned the cumulative downstream cost incurred from that point onward:
rt=−∑τ=tTcτ,r\_\{t\}=\-\\sum\_\{\\tau=t\}^\{T\}c\_\{\\tau\},wherecτc\_\{\\tau\}corresponds to eithercτsysc\_\{\\tau\}^\{\\mathrm\{sys\}\}orcτ\(a\)c\_\{\\tau\}^\{\(a\)\}depending on the chosen reward scope\. This formulation attributes credit to actions based on their long\-term impact on future costs, allowing earlier decisions to be evaluated according to the downstream consequences they induce\.
##### Group relative policy optimization \(GRPO\)\.
Group Relative Policy Optimization \(GRPO\) is well\-suited to the supply\-chain setting because it avoids the need to learn a separate value function or critic over a high\-dimensional, partially observed, and multi\-agent state space\. Instead, it uses the group of sampled trajectories as an implicit baseline\. At each training step, multiple episodes are generated under the same environment, and their realized costs are compared\. Trajectories that achieve lower costs than their peers are reinforced, while those with higher costs are discouraged\. This relative evaluation provides a stable learning signal without requiring explicit value estimation\.
##### Advantage construction\.
This comparison is formalized through a group\-normalized advantage, which we denote byAdvk,t\(i\)\\mathrm\{Adv\}\_\{k,t\}^\{\(i\)\}, whereiiindexes the episode,ttindexes the week, andk∈𝒜k\\in\\mathcal\{A\}indexes the agent\. Letrk,t\(i\)r\_\{k,t\}^\{\(i\)\}denote the reward assigned to agentkkat weekttin episodeii\. The advantage is computed by normalizing this reward across theGGepisodes collected in the same training step:
Advk,t\(i\)=rk,t\(i\)−1G∑j=1Grk,t\(j\)1G∑j=1G\(rk,t\(j\)−1G∑h=1Grk,t\(h\)\)2\+εnorm\.\\mathrm\{Adv\}\_\{k,t\}^\{\(i\)\}=\\frac\{r\_\{k,t\}^\{\(i\)\}\-\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}r\_\{k,t\}^\{\(j\)\}\}\{\\sqrt\{\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}\\left\(r\_\{k,t\}^\{\(j\)\}\-\\frac\{1\}\{G\}\\sum\_\{h=1\}^\{G\}r\_\{k,t\}^\{\(h\)\}\\right\)^\{2\}\}\+\\varepsilon\_\{\\mathrm\{norm\}\}\}\.The small constantεnorm\>0\\varepsilon\_\{\\mathrm\{norm\}\}\>0prevents division by zero and is distinct from the decision\-shock notation in Section[5](https://arxiv.org/html/2605.17036#S5)\. This normalization ensures that learning depends on relative performance within the group rather than the absolute scale of realized costs\. For reward\-to\-go signals, normalization is performed separately for each week and agent, so that a decision is compared only with decisions made by the same agent at the same temporal position across episodes\. Episode\-level and system\-level rewards are recovered as special cases: under episode\-level attribution,rk,t\(i\)r\_\{k,t\}^\{\(i\)\}is constant acrosstt, while under system\-level scope, the same system reward is shared across agents\.
##### Policy representation\.
The LLM defines a stochastic policyπθ\\pi\_\{\\theta\}that maps supply\-chain contexts to ordering decisions\. Each episodeiiproduces a sequence of weekly decisionsy\(i\)=\(y1\(i\),…,yT\(i\)\)y^\{\(i\)\}=\(y\_\{1\}^\{\(i\)\},\\ldots,y\_\{T\}^\{\(i\)\}\), where eachyt\(i\)y\_\{t\}^\{\(i\)\}is a vector of actions across the four agents,yt\(i\)=\(yr,t\(i\),yw,t\(i\),yd,t\(i\),yf,t\(i\)\)y\_\{t\}^\{\(i\)\}=\\big\(y\_\{r,t\}^\{\(i\)\},\\,y\_\{w,t\}^\{\(i\)\},\\,y\_\{d,t\}^\{\(i\)\},\\,y\_\{f,t\}^\{\(i\)\}\\big\)\. Each component corresponds to the order quantity placed at weekttby the retailer, wholesaler, distributor, and factory, respectively\. The four agents make separate decisions based on their local supply\-chain contexts\. Conditional on these contexts, the joint likelihood factorizes across weeks and agents:
logπθ\(yi∣x\)=∑t=1T∑k∈𝒜logπθ\(yk,t\(i\)∣xk,t\(i\)\),\\log\\pi\_\{\\theta\}\(y\_\{i\}\\mid x\)=\\sum\_\{t=1\}^\{T\}\\sum\_\{k\\in\\mathcal\{A\}\}\\log\\pi\_\{\\theta\}\\big\(y\_\{k,t\}^\{\(i\)\}\\mid x\_\{k,t\}^\{\(i\)\}\\big\),where𝒜=\{retailer, wholesaler, distributor, factory\}\\mathcal\{A\}=\\\{\\text\{retailer, wholesaler, distributor, factory\}\\\}andxa,t\(i\)x\_\{a,t\}^\{\(i\)\}denotes the context available to agentaaat weekttin episodeii\. Each term is computed from the token\-level log\-probabilities of the generated response for that agent\.
##### Objective function\.
GRPO updates the shared policy by reinforcing agent\-week decisions that perform well relative to comparable decisions in the same training group\. The corresponding objective is
𝒥GRPO\(θ\)=𝔼\[1G∑i=1G1T\|𝒜\|∑t=1T∑k∈𝒜Advk,t\(i\)logπθ\(yk,t\(i\)∣xk,t\(i\)\)−βDKL\(πθ∥πref\)\]\.\\displaystyle\\mathcal\{J\}\_\{\\mathrm\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{T\|\\mathcal\{A\}\|\}\\sum\_\{t=1\}^\{T\}\\sum\_\{k\\in\\mathcal\{A\}\}\\mathrm\{Adv\}\_\{k,t\}^\{\(i\)\}\\log\\pi\_\{\\theta\}\\big\(y\_\{k,t\}^\{\(i\)\}\\mid x\_\{k,t\}^\{\(i\)\}\\big\)\-\\beta D\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\)\\right\]\.\(25\)
Thus, positive advantages increase the likelihood of the corresponding agent decisions, while negative advantages decrease it\. Since all agents share the same LLM backbone, gradients from all agents and all weeks are aggregated into a single parameter update\. The KL penalty stabilizes training by constraining the updated policy to remain close to the frozen reference model\.
##### Model update\.
Each training step consists of two phases: data collection and policy optimization\. First,GGbeer game episodes are simulated using the current policy with stochastic sampling, producing trajectories of decisions and associated cost signals\. Based on the chosen reward scope and attribution scheme, advantages are computed either at the episode level or as reward\-to\-go signals and normalized across the group\.
Second, the model parameters are updated via stochastic gradient ascent on the GRPO objective\. In practice, for each trajectoryii, we compute the mean token\-level log\-probability and weight it by the corresponding advantage:
∇θ𝒥≈1GT\|𝒜\|∑i=1G∑t=1T∑k∈𝒜Advk,t\(i\)∇θlogπθ\(yk,t\(i\)∣xk,t\(i\)\)−β∇θDKL\(πθ∥πref\)\.\\displaystyle\\nabla\_\{\\theta\}\\mathcal\{J\}\\approx\\frac\{1\}\{GT\|\\mathcal\{A\}\|\}\\sum\_\{i=1\}^\{G\}\\sum\_\{t=1\}^\{T\}\\sum\_\{k\\in\\mathcal\{A\}\}\\mathrm\{Adv\}\_\{k,t\}^\{\(i\)\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\\big\(y\_\{k,t\}^\{\(i\)\}\\mid x\_\{k,t\}^\{\(i\)\}\\big\)\-\\beta\\nabla\_\{\\theta\}D\_\{\\mathrm\{KL\}\}\\big\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\\big\)\.\(26\)Gradients are accumulated across all trajectories, optionally regularized with a KL penalty against a frozen reference model, clipped to control variance, and applied using an optimizer such as AdamW\. This procedure iteratively shifts the policy toward generating decisions that achieve lower supply\-chain costs relative to competing episodes\.
##### Evaluation protocol\.
After training, the learned policy is evaluated on the standard MIT Beer Game demand pattern:
Dt=\(4,4,4,4,8,…,8\)\.D\_\{t\}=\(4,4,4,4,8,\\ldots,8\)\.Performance is measured over 30 independent simulation runs, and we report the average total supply chain cost as well as per\-agent costs\.
## 7GRPO Post\-training Results
Figure 4:Impact of post\-training on order reliability\.Post\-training significantly compresses decision variance across all facilities and mitigates outlier events\. Note: The y\-axis scale is held constant with Figures 2 and 3 to facilitate direct comparison\.Post\-training substantially improves the reliability of LLM agents in inventory management\. Across 30 identical runs of the MIT Beer Game under the original demand pattern, the default setting in Figure[2](https://arxiv.org/html/2605.17036#S4.F2)exhibits wide interquartile ranges, long whiskers, and numerous extreme tail events in the distribution of orders\. By contrast, after post\-training the model through reinforcement learning on the training curriculum using the GRPO algorithm described above, Figure[4](https://arxiv.org/html/2605.17036#S7.F4)shows that the colored boxes representing the upper and lower percentiles largely disappear\. This indicates that variation across repeated runs becomes minimal\. The trained model therefore produces much more stable ordering decisions, both over time and across all four echelons of the supply chain\.
Equally important, post\-training sharply reduces tail risk\. In the post\-training evaluation, the maximum order observed across all facilities and weeks remains below 100, despite the absence of either an explicit budget constraint or a centralized orchestration layer\. This result is especially notable because the earlier analysis showed that out\-of\-the\-box agents often required external guardrails and carefully curated information to prevent panic\-induced over\-ordering\. By contrast, post\-training appears to enable the model to internalize a more disciplined and reliable decision policy, thereby reducing the need for external controls\.
Figure[5](https://arxiv.org/html/2605.17036#S7.F5)extends the analysis from decision behavior to system\-level performance by examining total supply chain costs across repeated runs\. Consistent with the stabilization in ordering decisions, post\-training yields a substantial improvement in both efficiency and robustness\. For Qwen\-3 4B, average total supply chain cost declines from 1,585 without training to 952 after post\-training, while the coefficient of variation falls from 26% to 13%\. Tail risk also contracts sharply, with the maximum realized cost across 30 identical runs decreasing from 2,847 to 1,353\. By comparison, other out\-of\-the\-box models exhibit both higher average costs and greater variability, including GPT\-5 mini at 3,927 with a 45% coefficient of variation and a maximum of 8,644, and Llama 4 Maverick 17B at 4,026 with a 52% coefficient of variation and a maximum of 8,912\. These results indicate that post\-training not only lowers mean cost, but also substantially narrows the distribution of outcomes across identical runs and reduces exposure to high\-cost realizations\.
Figure 5:Post\-training improves agent reliability across multiple dimensions: it reduces total supply chain costs, lowers variability across repeated runs, and mitigates worst\-case outcomes, thereby improving robustness to tail risks\.These findings shed light on the source of unreliability in autonomous supply chains\. The instability we document does not appear to be an inherent consequence of the stochasticity of language models\. Rather, it arises from deploying general\-purpose models that have not received specialized training for inventory management\. Once the agent is exposed to a curriculum of supply chain tasks with synthetic demand and optimized using realized cost feedback, much of the apparent randomness in its supply chain behavior disappears\. The trained agent is both more reliable and more efficient: it makes more consistent decisions across runs, exhibits fewer extreme overreactions, and achieves lower average system\-wide costs\. This suggests that unreliability is, to a considerable extent, a consequence of insufficient domain specialization rather than irreducible stochasticity, and that specialized post\-training can improve both the stability and the economic performance of autonomous supply chain agents\.
## 8Conclusion: New Paradigm for Supply Chain Management
GenAI has brought supply chain management to an inflection point: a fully autonomous supply chain is moving rapidly from theory to practice\. Early AI models lacked the reasoning required for complex, strategic supply\-chain decisions; modern models have closed that gap\.
We show that GenAI agents can match and often exceed human decision\-making in planning and replenishment — state\-of\-the\-art models outperform cohorts of students operating the same system\. With the right combination of model selection, prompts, guardrails, and orchestrated information sharing, autonomous agents already achieve strong average performance\.
However, out\-of\-the\-box agents remain unreliable in critical ways\. They can exhibit high order volatility—varying across facilities for the same time period and across time for the same facility—and this is not merely the result of model sampling\. Addressing it requires post\-training on synthetically generated, task\-specific data so agents learn disciplined inventory\-management strategies\. Our post\-training results show substantial gains in reliability, reducing tail risk without sacrificing average performance\.
This implies the next frontier is not only better deployment of general\-purpose LLMs but the development of specialized AI operations agents that internalize the structure of dynamic inventory control\. For firms evaluating AI for planning and replenishment, the difference is material: a system that performs well on average but has unstable tail behavior may be operationally unacceptable, whereas a trained, low volatility agent is much closer to production readiness\.
As AI adoption advances, supply\-chain leaders will move from hands\-on management to strategic orchestration of GenAI agents\. Success depends on four levers: selecting the right model, targeted training and guardrails, curated information orchestration, and precise instruction design\. Mastering these is the new playbook for autonomous supply chains\.
## Acknowledgements
The paper expands and provides more technical details on the concepts and framework described in a recent article: Long, C\., Simchi\-Levi, D\., Calmon, A\. P\., & Calmon, F\. P\. When supply chains become autonomous\. Harvard Business Review\(Longet al\.,[2025a](https://arxiv.org/html/2605.17036#bib.bib37)\)\.
This material is based upon work supported by the National Science Foundation under Grant No FAI 2040880, CIF 2231707, and CIF 2312667\. F\. P\. Calmon and C\. Long would also like to acknowledge support from Coefficient Giving and JPMorgan Chase\. F\.P\. Calmon is also affiliated with Google Research as a Visiting Faculty Researcher\.
The authors would like to acknowledge support from Harvard Information Theory Lab, MIT Data Science Lab, Microsoft Accelerating AI Academic Research \(AAAR\) program, the Kempner Institute at Harvard University, and the Ray C\. Anderson Center for Sustainable Business at Georgia Tech\.
## References
- J\. Achiam, D\. Held, A\. Tamar, and P\. Abbeel \(2017\)Constrained policy optimization\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 22–31\.Cited by:[§2\.6](https://arxiv.org/html/2605.17036#S2.SS6.p2.1)\.
- D\. Bertsimas and A\. Thiele \(2006\)A robust optimization approach to inventory theory\.Operations Research54\(1\),pp\. 150–168\.External Links:[Document](https://dx.doi.org/10.1287/opre.1050.0238)Cited by:[§2\.6](https://arxiv.org/html/2605.17036#S2.SS6.p2.1)\.
- L\. Boussioux, A\. Chen, M\. Fan, and A\. Jain \(2025\)Socratic iterative reasoning: enhancing llm decision\-making in the beer game supply chain\.Cited by:[§2\.2](https://arxiv.org/html/2605.17036#S2.SS2.p4.1)\.
- B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini \(2024\)Large language monkeys: scaling inference compute with repeated sampling\.arXiv preprint arXiv:2407\.21787\.Cited by:[§2\.3](https://arxiv.org/html/2605.17036#S2.SS3.p1.1)\.
- F\. Chen, Z\. Drezner, J\. K\. Ryan, and D\. Simchi\-Levi \(2000a\)Quantifying the bullwhip effect in a simple supply chain: the impact of forecasting, lead times, and information\.Management science46\(3\),pp\. 436–443\.Cited by:[§2\.1](https://arxiv.org/html/2605.17036#S2.SS1.p3.1),[§5\.3\.1](https://arxiv.org/html/2605.17036#S5.SS3.SSS1.p2.3)\.
- F\. Chen, J\. K\. Ryan, and D\. Simchi\-Levi \(2000b\)The impact of exponential smoothing forecasts on the bullwhip effect\.Naval Research Logistics47\(4\),pp\. 269–286\.Cited by:[§2\.1](https://arxiv.org/html/2605.17036#S2.SS1.p3.1),[§5\.1](https://arxiv.org/html/2605.17036#S5.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.17036#S5.SS1.SSS0.Px1.p2.4),[§5\.3\.1](https://arxiv.org/html/2605.17036#S5.SS3.SSS1.p2.3)\.
- Y\. Chow, A\. Tamar, S\. Mannor, and M\. Pavone \(2015\)Risk\-sensitive and robust decision\-making: a CVaR optimization approach\.InAdvances in Neural Information Processing Systems,Vol\.28\.Cited by:[§2\.6](https://arxiv.org/html/2605.17036#S2.SS6.p2.1)\.
- J\. W\. Forrester \(1961\)Industrial dynamics\.MIT Press,Cambridge, MA\.Cited by:[§2\.1](https://arxiv.org/html/2605.17036#S2.SS1.p1.1)\.
- M\. S\. Fox, M\. Barbuceanu, and R\. Teigen \(2000\)Agent\-oriented supply\-chain management\.International Journal of Flexible Manufacturing Systems12\(2\),pp\. 165–188\.External Links:[Document](https://dx.doi.org/10.1023/A%3A1008195614074)Cited by:[§2\.4](https://arxiv.org/html/2605.17036#S2.SS4.p1.1)\.
- G\. Gallego and I\. Moon \(1993\)The distribution free newsboy problem: review and extensions\.Journal of the Operational Research Society44\(8\),pp\. 825–834\.External Links:[Document](https://dx.doi.org/10.1057/jors.1993.141)Cited by:[§2\.6](https://arxiv.org/html/2605.17036#S2.SS6.p2.1)\.
- J\. García and F\. Fernández \(2015\)A comprehensive survey on safe reinforcement learning\.Journal of Machine Learning Research16\(42\),pp\. 1437–1480\.Cited by:[§2\.6](https://arxiv.org/html/2605.17036#S2.SS6.p2.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645,pp\. 633–638\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§2\.5](https://arxiv.org/html/2605.17036#S2.SS5.p1.1)\.
- V\. Jannelli, S\. Schoepf, M\. Bickel, T\. Netland, and A\. Brintrup \(2026\)Agentic llms in the supply chain: towards autonomous multi\-agent consensus\-seeking\.International Journal of Production Research,pp\. 1–31\.Cited by:[§2\.4](https://arxiv.org/html/2605.17036#S2.SS4.p2.1)\.
- N\. Julka, R\. Srinivasan, and I\. A\. Karimi \(2002\)Agent\-based supply chain management–1: framework\.Computers & Chemical Engineering26\(12\),pp\. 1755–1769\.External Links:[Document](https://dx.doi.org/10.1016/S0098-1354%2802%2900150-3)Cited by:[§2\.4](https://arxiv.org/html/2605.17036#S2.SS4.p1.1)\.
- B\. Kim, J\. G\. Kim, and S\. Lee \(2024\)A multi\-agent reinforcement learning model for inventory transshipments under supply chain disruption\.IISE Transactions56\(7\),pp\. 715–728\.External Links:[Document](https://dx.doi.org/10.1080/24725854.2023.2217248)Cited by:[§2\.4](https://arxiv.org/html/2605.17036#S2.SS4.p2.1)\.
- N\. Kotecha and A\. del Rio Chanona \(2025\)Leveraging graph neural networks and multi\-agent reinforcement learning for inventory control in supply chains\.Computers & Chemical Engineering199,pp\. 109111\.Cited by:[§2\.4](https://arxiv.org/html/2605.17036#S2.SS4.p2.1)\.
- H\. L\. Lee, V\. Padmanabhan, and S\. Whang \(1997a\)Information distortion in a supply chain: the bullwhip effect\.Management Science43\(4\),pp\. 546–558\.External Links:[Document](https://dx.doi.org/10.1287/mnsc.43.4.546)Cited by:[§2\.1](https://arxiv.org/html/2605.17036#S2.SS1.p3.1)\.
- H\. L\. Lee, V\. Padmanabhan, and S\. Whang \(1997b\)The bullwhip effect in supply chains\.Sloan Management Review38\(3\),pp\. 93–102\.Cited by:[§2\.1](https://arxiv.org/html/2605.17036#S2.SS1.p3.1)\.
- C\. Long, D\. Simchi\-Levi, A\. P\. Calmon, and F\. P\. Calmon \(2025a\)When supply chains become autonomous\.Harvard Business Review\. Online article\.Cited by:[§2\.2](https://arxiv.org/html/2605.17036#S2.SS2.p3.1),[Acknowledgements](https://arxiv.org/html/2605.17036#Sx1.p1.1),[footnote 1](https://arxiv.org/html/2605.17036#footnote1)\.
- C\. Long, D\. Simchi\-Levi, A\. P\. Calmon, and F\. P\. Calmon \(2025b\)The genai beer game\.Note:\[Online\]\. Available:[https://infotheorylab\.github\.io/beer\-game/](https://infotheorylab.github.io/beer-game/)Accessed: December 15, 2025Cited by:[§2\.2](https://arxiv.org/html/2605.17036#S2.SS2.p3.1)\.
- I\. Menache, J\. Pathuri, D\. Simchi\-Levi, and T\. Linton \(2025\)How generative ai improves supply chain management\.Harvard Business Review104\(1\-2\),pp\. 86–95\.Cited by:[§1](https://arxiv.org/html/2605.17036#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.17036#S2.SS2.p1.1)\.
- M\. E\. Nissen \(2001\)Agent\-based supply chain integration\.Information Technology and Management2\(3\),pp\. 289–312\.Cited by:[§2\.4](https://arxiv.org/html/2605.17036#S2.SS4.p1.1)\.
- Y\. Quan and Z\. Liu \(2024\)Invagent: a large language model based multi\-agent system for inventory management in supply chains\.arXiv preprint arXiv:2407\.11384\.Cited by:[§2\.2](https://arxiv.org/html/2605.17036#S2.SS2.p1.1),[§2\.4](https://arxiv.org/html/2605.17036#S2.SS4.p2.1)\.
- R\. T\. Rockafellar and S\. Uryasev \(2000\)Optimization of conditional value\-at\-risk\.Journal of Risk2\(3\),pp\. 21–41\.External Links:[Document](https://dx.doi.org/10.21314/JOR.2000.038)Cited by:[§2\.6](https://arxiv.org/html/2605.17036#S2.SS6.p2.1)\.
- H\. E\. Scarf \(1958\)A min\-max solution of an inventory problem\.InStudies in the Mathematical Theory of Inventory and Production,K\. J\. Arrow, S\. Karlin, and H\. E\. Scarf \(Eds\.\),pp\. 201–209\.Cited by:[§2\.6](https://arxiv.org/html/2605.17036#S2.SS6.p2.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2\.5](https://arxiv.org/html/2605.17036#S2.SS5.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2\.5](https://arxiv.org/html/2605.17036#S2.SS5.p1.1)\.
- E\. A\. Silver, D\. F\. Pyke, and R\. Peterson \(1998\)Inventory management and production planning and scheduling\.3 edition,John Wiley & Sons,New York\.Cited by:[§5\.1](https://arxiv.org/html/2605.17036#S5.SS1.SSS0.Px1.p2.4)\.
- D\. Simchi\-Levi, K\. Mellou, I\. Menache, and J\. Pathuri \(2025a\)Large language models for supply chain decisions\.arXiv preprint arXiv:2507\.21502\.Cited by:[§2\.2](https://arxiv.org/html/2605.17036#S2.SS2.p1.1)\.
- D\. Simchi\-Levi, Z\. Zheng, and F\. Zhu \(2023\)Regret distribution in stochastic bandits: optimal trade\-off between expectation and tail risk\.arXiv preprint arXiv:2304\.04341\.Cited by:[§2\.6](https://arxiv.org/html/2605.17036#S2.SS6.p1.1)\.
- D\. Simchi\-Levi, Z\. Zheng, and F\. Zhu \(2025b\)A simple and optimal policy design with safety against heavy\-tailed risk for stochastic bandits\.Management Science71\(7\),pp\. 6298–6318\.Cited by:[§2\.6](https://arxiv.org/html/2605.17036#S2.SS6.p1.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§2\.3](https://arxiv.org/html/2605.17036#S2.SS3.p1.1)\.
- J\. D\. Sterman \(1989\)Modeling managerial behavior: misperceptions of feedback in a dynamic decision making experiment\.Management Science35\(3\),pp\. 321–339\.External Links:[Document](https://dx.doi.org/10.1287/mnsc.35.3.321)Cited by:[§2\.1](https://arxiv.org/html/2605.17036#S2.SS1.p1.1)\.
- J\. M\. Swaminathan, S\. F\. Smith, and N\. M\. Sadeh \(1998\)Modeling supply chain dynamics: a multiagent approach\.Decision Sciences29\(3\),pp\. 607–632\.External Links:[Document](https://dx.doi.org/10.1111/j.1540-5915.1998.tb01356.x)Cited by:[§2\.4](https://arxiv.org/html/2605.17036#S2.SS4.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[§2\.3](https://arxiv.org/html/2605.17036#S2.SS3.p1.1),[§4\.3](https://arxiv.org/html/2605.17036#S4.SS3.p1.1)\.
- Y\. Wu, Z\. Sun, S\. Li, S\. Welleck, and Y\. Yang \(2024\)Inference scaling laws: an empirical analysis of compute\-optimal inference for problem\-solving with language models\.arXiv preprint arXiv:2408\.00724\.Cited by:[§2\.3](https://arxiv.org/html/2605.17036#S2.SS3.p1.1)\.
- L\. Xu, S\. Almahri, S\. Mak, and A\. Brintrup \(2024a\)Multi\-agent systems and foundation models enable autonomous supply chains: opportunities and challenges\.IFAC\-PapersOnLine58\(19\),pp\. 795–800\.Cited by:[§2\.2](https://arxiv.org/html/2605.17036#S2.SS2.p1.1)\.
- L\. Xu, S\. Mak, M\. Minaricova, and A\. Brintrup \(2024b\)On implementing autonomous supply chains: a multi\-agent system approach\.Computers in Industry161,pp\. 104120\.Cited by:[§2\.2](https://arxiv.org/html/2605.17036#S2.SS2.p1.1),[§2\.4](https://arxiv.org/html/2605.17036#S2.SS4.p2.1)\.
- G\. Zheng, S\. Almahri, L\. Xu, M\. Minaricova, and A\. Brintrup \(2025\)LLMs in supply chain management: opportunities and a case study\.IFAC\-PapersOnLine59\(10\),pp\. 2951–2956\.Cited by:[§2\.2](https://arxiv.org/html/2605.17036#S2.SS2.p1.1)\.
- F\. Zhu and D\. Simchi\-Levi \(2026\)Adaptive variance inflation in thompson sampling: efficiency, safety, robustness, and beyond\.Advances in Neural Information Processing Systems38,pp\. 50466–50484\.Cited by:[§2\.6](https://arxiv.org/html/2605.17036#S2.SS6.p1.1)\.
## Appendix AAdditional Information for Section[3](https://arxiv.org/html/2605.17036#S3)
### A\.1Detailed Experimental Results
The following tables present the underlying numerical data that support the findings discussed in the main text\. Table[2](https://arxiv.org/html/2605.17036#A1.T2)reports the total supply chain costs recorded across eleven runs of the Beer Game played by human student teams \(4\-8 students per team, 100\+ students in total\) from two Georgia Tech cohorts \(April 2025 and April 2024\), together with the average cost of $3,207 that serves as the human performance benchmark throughout the study\. Table[3](https://arxiv.org/html/2605.17036#A1.T3)summarises the aggregate performance of all gen AI configurations tested under the classical Beer Game setting played by students \(20\-week, 2\-2\-2 lead times\), listing total costs and normalized costs relative to the human benchmark for each combination of model type and inference\-time technique; values below 100 indicate that the agent outperformed the average human team\. Table[4](https://arxiv.org/html/2605.17036#A1.T4)identifies the specific model comparisons underlying the numerical results reported in the main text, together with the corresponding percentage changes in total supply chain costs and differences in the coefficient of variance\.
Table 2:Human Teams: Total Costs per RunTable 3:Supply Chain Performance: 20 Weeks Beer GameTable 4:Supply Chain Analysis Summary
### A\.2LLM Prompts
Example Prompt for RetailerYou are the Retailer in the Beer Distribution Game\. Your objective is to minimize your total supply chain costs by managing your beer inventory efficiently\. You receive orders from customers and stock up your inventory from the Wholesaler\. Your only task is to decide, based on your inventory status and incoming order \(shown below\), how many new cases of beer you want to buy this week\.Here are the costs you face: \- Holding Cost: 0\.50 per case per week\. \- Backorder Cost: 1\.00 per case per week\. \- Order Lead Time: 1 week \(your order reaches the Wholesaler next week\)\. \- Shipping Lead Time: 2 weeks \(your delivery from the Wholesaler arrives 2 weeks after they ship\)\.\*\*Your Current Situation \(Week week\):\*\* \- Current Inventory: current\_inventory cases \- Current Backlog: current\_backlog cases \- Incoming Order from Downstream \(Customer Demand\): incoming\_order\_this\_week cases \- Last Order You Placed: last\_order\_placed cases \- Last Delivery You Received: last\_delivery\_received cases pipeline\_infobudget\_infofixed\_cost\_infoorder\_forecast\_infofeedback\_info—————————Your Task: Decide how many cases of beer to order from your upstream this week based on your current situation\.Start your response with a JSON object \*\*on its own line\*\* in the following exact format: "order\_quantity": <number\_of\_cases\>Important: \- Replace ‘<number\_of\_cases\>‘ with your actual numeric decision\. \- Do not add any text, notes, or punctuation after the JSON\. \- This will be parsed by a program, so the format must be valid and exact\.Example \(your response should end like this\): "order\_quantity": 5
## Appendix BProof of Section[5](https://arxiv.org/html/2605.17036#S5)
###### Proof of Proposition[1](https://arxiv.org/html/2605.17036#Thmproposition1)\.
By definition,
IPk,t=OHk,t\+Ok,t−Bk,t\.IP\_\{k,t\}=OH\_\{k,t\}\+O\_\{k,t\}\-B\_\{k,t\}\.\(27\)
Therefore,
IPk,t\+1=OHk,t\+1\+Ok,t\+1−Bk,t\+1\.IP\_\{k,t\+1\}=OH\_\{k,t\+1\}\+O\_\{k,t\+1\}\-B\_\{k,t\+1\}\.\(28\)
Using the operational state equations,
OHk,t\+1\\displaystyle OH\_\{k,t\+1\}=OHk,t\+rk,t−sk,t,\\displaystyle=OH\_\{k,t\}\+r\_\{k,t\}\-s\_\{k,t\},\(29\)Ok,t\+1\\displaystyle O\_\{k,t\+1\}=Ok,t\+qk,t−rk,t,\\displaystyle=O\_\{k,t\}\+q\_\{k,t\}\-r\_\{k,t\},\(30\)Bk,t\+1\\displaystyle B\_\{k,t\+1\}=Bk,t\+qk−1,t−sk,t\.\\displaystyle=B\_\{k,t\}\+q\_\{k\-1,t\}\-s\_\{k,t\}\.\(31\)
Substituting these three equations into the definition ofIPk,t\+1IP\_\{k,t\+1\}gives
IPk,t\+1\\displaystyle IP\_\{k,t\+1\}=\(OHk,t\+rk,t−sk,t\)\+\(Ok,t\+qk,t−rk,t\)−\(Bk,t\+qk−1,t−sk,t\)\.\\displaystyle=\\left\(OH\_\{k,t\}\+r\_\{k,t\}\-s\_\{k,t\}\\right\)\+\\left\(O\_\{k,t\}\+q\_\{k,t\}\-r\_\{k,t\}\\right\)\-\\left\(B\_\{k,t\}\+q\_\{k\-1,t\}\-s\_\{k,t\}\\right\)\.\(32\)
Expanding terms,
IPk,t\+1\\displaystyle IP\_\{k,t\+1\}=OHk,t\+rk,t−sk,t\+Ok,t\+qk,t−rk,t−Bk,t−qk−1,t\+sk,t\.\\displaystyle=OH\_\{k,t\}\+r\_\{k,t\}\-s\_\{k,t\}\+O\_\{k,t\}\+q\_\{k,t\}\-r\_\{k,t\}\-B\_\{k,t\}\-q\_\{k\-1,t\}\+s\_\{k,t\}\.\(33\)
The receipt termsrk,tr\_\{k,t\}cancel, and the shipment termssk,ts\_\{k,t\}also cancel\. Hence
IPk,t\+1\\displaystyle IP\_\{k,t\+1\}=OHk,t\+Ok,t−Bk,t\+qk,t−qk−1,t\.\\displaystyle=OH\_\{k,t\}\+O\_\{k,t\}\-B\_\{k,t\}\+q\_\{k,t\}\-q\_\{k\-1,t\}\.\(34\)
Using the definition of inventory position,
OHk,t\+Ok,t−Bk,t=IPk,t\.OH\_\{k,t\}\+O\_\{k,t\}\-B\_\{k,t\}=IP\_\{k,t\}\.\(35\)
Therefore,
IPk,t\+1=IPk,t\+qk,t−qk−1,t\.IP\_\{k,t\+1\}=IP\_\{k,t\}\+q\_\{k,t\}\-q\_\{k\-1,t\}\.\(36\)
This proves the claim\. ∎
###### Proof of Proposition[2](https://arxiv.org/html/2605.17036#Thmproposition2)\.
Letak:=1−λka\_\{k\}:=1\-\\lambda\_\{k\},ut:=qk−1,tu\_\{t\}:=q\_\{k\-1,t\},yt:=qk,ty\_\{t\}:=q\_\{k,t\},ft:=q^k,tf\_\{t\}:=\\hat\{q\}\_\{k,t\}, andϵt:=ϵk,t\\epsilon\_\{t\}:=\\epsilon\_\{k,t\}\. The linear order rule gives
IPk,t=θkft\+ϵt−yt\.IP\_\{k,t\}=\\theta\_\{k\}f\_\{t\}\+\\epsilon\_\{t\}\-y\_\{t\}\.Using the inventory\-position recursion,
IPk,t\+1=θkft\+ϵt−ut\.IP\_\{k,t\+1\}=\\theta\_\{k\}f\_\{t\}\+\\epsilon\_\{t\}\-u\_\{t\}\.Applying the order rule one period forward yields
yt\+1=θk\(ft\+1−ft\)\+ut\+ϵt\+1−ϵt\.y\_\{t\+1\}=\\theta\_\{k\}\(f\_\{t\+1\}\-f\_\{t\}\)\+u\_\{t\}\+\\epsilon\_\{t\+1\}\-\\epsilon\_\{t\}\.\(37\)Sinceft\+1=λkut\+akftf\_\{t\+1\}=\\lambda\_\{k\}u\_\{t\}\+a\_\{k\}f\_\{t\},
ft\+1−ft=λk\(ut−ft\)\.f\_\{t\+1\}\-f\_\{t\}=\\lambda\_\{k\}\(u\_\{t\}\-f\_\{t\}\)\.Thus
yt\+1=\(1\+θkλk\)ut−θkλkft\+ϵt\+1−ϵt\.y\_\{t\+1\}=\(1\+\\theta\_\{k\}\\lambda\_\{k\}\)u\_\{t\}\-\\theta\_\{k\}\\lambda\_\{k\}f\_\{t\}\+\\epsilon\_\{t\+1\}\-\\epsilon\_\{t\}\.\(38\)
It remains to eliminateftf\_\{t\}\. Applying \([37](https://arxiv.org/html/2605.17036#A2.E37)\) one period earlier gives
yt=θk\(ft−ft−1\)\+ut−1\+ϵt−ϵt−1\.y\_\{t\}=\\theta\_\{k\}\(f\_\{t\}\-f\_\{t\-1\}\)\+u\_\{t\-1\}\+\\epsilon\_\{t\}\-\\epsilon\_\{t\-1\}\.Usingft−ft−1=λk\(ut−1−ft−1\)f\_\{t\}\-f\_\{t\-1\}=\\lambda\_\{k\}\(u\_\{t\-1\}\-f\_\{t\-1\}\), we obtain
θkλkft−1=θkλkut−1\+ut−1\+ϵt−ϵt−1−yt\.\\theta\_\{k\}\\lambda\_\{k\}f\_\{t\-1\}=\\theta\_\{k\}\\lambda\_\{k\}u\_\{t\-1\}\+u\_\{t\-1\}\+\\epsilon\_\{t\}\-\\epsilon\_\{t\-1\}\-y\_\{t\}\.The forecast recursionft=λkut−1\+akft−1f\_\{t\}=\\lambda\_\{k\}u\_\{t\-1\}\+a\_\{k\}f\_\{t\-1\}then implies
θkλkft\\displaystyle\\theta\_\{k\}\\lambda\_\{k\}f\_\{t\}=θkλk2ut−1\+akθkλkft−1\\displaystyle=\\theta\_\{k\}\\lambda\_\{k\}^\{2\}u\_\{t\-1\}\+a\_\{k\}\\theta\_\{k\}\\lambda\_\{k\}f\_\{t\-1\}\(39\)=\(θkλk\+ak\)ut−1\+ak\(ϵt−ϵt−1\)−akyt\.\\displaystyle=\(\\theta\_\{k\}\\lambda\_\{k\}\+a\_\{k\}\)u\_\{t\-1\}\+a\_\{k\}\(\\epsilon\_\{t\}\-\\epsilon\_\{t\-1\}\)\-a\_\{k\}y\_\{t\}\.\(40\)Substituting \([40](https://arxiv.org/html/2605.17036#A2.E40)\) into \([38](https://arxiv.org/html/2605.17036#A2.E38)\) gives
yt\+1=\(1\+θkλk\)ut−\(θkλk\+ak\)ut−1\+akyt\+ϵt\+1−\(1\+ak\)ϵt\+akϵt−1\.y\_\{t\+1\}=\(1\+\\theta\_\{k\}\\lambda\_\{k\}\)u\_\{t\}\-\(\\theta\_\{k\}\\lambda\_\{k\}\+a\_\{k\}\)u\_\{t\-1\}\+a\_\{k\}y\_\{t\}\+\\epsilon\_\{t\+1\}\-\(1\+a\_\{k\}\)\\epsilon\_\{t\}\+a\_\{k\}\\epsilon\_\{t\-1\}\.Returning to the original notation and usingak=1−λka\_\{k\}=1\-\\lambda\_\{k\}proves \([18](https://arxiv.org/html/2605.17036#S5.E18)\)\.
Writing the recurrence at timettin lag\-operator form gives
\[1−akℒ\]qk,t=\[\(1\+θkλk\)ℒ−\(θkλk\+ak\)ℒ2\]qk−1,t\+\[1−akℒ\]\(1−ℒ\)ϵk,t\.\\left\[1\-a\_\{k\}\\mathcal\{L\}\\right\]q\_\{k,t\}=\\left\[\(1\+\\theta\_\{k\}\\lambda\_\{k\}\)\\mathcal\{L\}\-\(\\theta\_\{k\}\\lambda\_\{k\}\+a\_\{k\}\)\\mathcal\{L\}^\{2\}\\right\]q\_\{k\-1,t\}\+\\left\[1\-a\_\{k\}\\mathcal\{L\}\\right\]\(1\-\\mathcal\{L\}\)\\epsilon\_\{k,t\}\.Dividing by1−akℒ1\-a\_\{k\}\\mathcal\{L\}gives
q\(k\)=Hk\(ℒ\)q\(k−1\)\+\(1−ℒ\)ϵ\(k\),q^\{\(k\)\}=H\_\{k\}\(\\mathcal\{L\}\)q^\{\(k\-1\)\}\+\(1\-\\mathcal\{L\}\)\\epsilon^\{\(k\)\},whereHkH\_\{k\}is \([20](https://arxiv.org/html/2605.17036#S5.E20)\)\. HenceG\(ℒ\)=1−ℒG\(\\mathcal\{L\}\)=1\-\\mathcal\{L\}, proving \([19](https://arxiv.org/html/2605.17036#S5.E19)\) and \([21](https://arxiv.org/html/2605.17036#S5.E21)\)\. ∎
###### Lemma 1\(One\-tier gain\)\.
For tierkk, define
gk\(ω\):=\|Hk\(e−iω\)\|2\.g\_\{k\}\(\\omega\):=\|H\_\{k\}\(e^\{\-i\\omega\}\)\|^\{2\}\.Then
gk\(ω\)=1\+2θkλk\(2−λk\+θkλk\)\(1−cosω\)λk2\+2\(1−λk\)\(1−cosω\)≥1\.g\_\{k\}\(\\omega\)=1\+\\frac\{2\\theta\_\{k\}\\lambda\_\{k\}\(2\-\\lambda\_\{k\}\+\\theta\_\{k\}\\lambda\_\{k\}\)\(1\-\\cos\\omega\)\}\{\\lambda\_\{k\}^\{2\}\+2\(1\-\\lambda\_\{k\}\)\(1\-\\cos\\omega\)\}\\geq 1\.
Moreover, the average gain
Γk:=12π∫−ππgk\(ω\)𝑑ω\\Gamma\_\{k\}:=\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}g\_\{k\}\(\\omega\)\\,d\\omegais
Γk=1\+2θkλk\+2θk2λk22−λk\.\\Gamma\_\{k\}=1\+2\\theta\_\{k\}\\lambda\_\{k\}\+\\frac\{2\\theta\_\{k\}^\{2\}\\lambda\_\{k\}^\{2\}\}\{2\-\\lambda\_\{k\}\}\.Consequently,
Γk≥Γ:=1\+2θλ\+2θ2λ22−λ\>1\.\\Gamma\_\{k\}\\geq\\Gamma:=1\+2\\theta\\lambda\+\\frac\{2\\theta^\{2\}\\lambda^\{2\}\}\{2\-\\lambda\}\>1\.
###### Proof of Lemma[1](https://arxiv.org/html/2605.17036#Thmlemma1)\.
Writeak=1−λka\_\{k\}=1\-\\lambda\_\{k\}\. Since the leading lagℒ\\mathcal\{L\}has unit modulus on the unit circle, it does not affect the frequency gain\. Hence
gk\(ω\)=\|1\+θkλk\(1−e−iω\)1−ake−iω\|2\.g\_\{k\}\(\\omega\)=\\left\|1\+\\frac\{\\theta\_\{k\}\\lambda\_\{k\}\(1\-e^\{\-i\\omega\}\)\}\{1\-a\_\{k\}e^\{\-i\\omega\}\}\\right\|^\{2\}\.Combining terms gives
gk\(ω\)=\|1\+θkλk−\(ak\+θkλk\)e−iω1−ake−iω\|2\.g\_\{k\}\(\\omega\)=\\left\|\\frac\{1\+\\theta\_\{k\}\\lambda\_\{k\}\-\(a\_\{k\}\+\\theta\_\{k\}\\lambda\_\{k\}\)e^\{\-i\\omega\}\}\{1\-a\_\{k\}e^\{\-i\\omega\}\}\\right\|^\{2\}\.Using
\|1−ake−iω\|2=λk2\+2\(1−λk\)\(1−cosω\),\|1\-a\_\{k\}e^\{\-i\\omega\}\|^\{2\}=\\lambda\_\{k\}^\{2\}\+2\(1\-\\lambda\_\{k\}\)\(1\-\\cos\\omega\),a direct expansion yields
gk\(ω\)=1\+2θkλk\(2−λk\+θkλk\)\(1−cosω\)λk2\+2\(1−λk\)\(1−cosω\)\.g\_\{k\}\(\\omega\)=1\+\\frac\{2\\theta\_\{k\}\\lambda\_\{k\}\(2\-\\lambda\_\{k\}\+\\theta\_\{k\}\\lambda\_\{k\}\)\(1\-\\cos\\omega\)\}\{\\lambda\_\{k\}^\{2\}\+2\(1\-\\lambda\_\{k\}\)\(1\-\\cos\\omega\)\}\.The numerator and denominator of the second term are nonnegative, sogk\(ω\)≥1g\_\{k\}\(\\omega\)\\geq 1\. Ifω≠0\\omega\\neq 0, then1−cosω\>01\-\\cos\\omega\>0, and sinceθk\>0\\theta\_\{k\}\>0andλk\>0\\lambda\_\{k\}\>0, the inequality is strict\.
Next, expandHk\(ℒ\)H\_\{k\}\(\\mathcal\{L\}\)as an impulse\-response sequence\. We have
Hk\(ℒ\)=\(1\+θkλk\)ℒ−\(θkλk\+1−λk\)ℒ21−\(1−λk\)ℒ\.H\_\{k\}\(\\mathcal\{L\}\)=\\frac\{\(1\+\\theta\_\{k\}\\lambda\_\{k\}\)\\mathcal\{L\}\-\(\\theta\_\{k\}\\lambda\_\{k\}\+1\-\\lambda\_\{k\}\)\\mathcal\{L\}^\{2\}\}\{1\-\(1\-\\lambda\_\{k\}\)\\mathcal\{L\}\}\.Therefore the impulse coefficientshk,jh\_\{k,j\}ofHkH\_\{k\}satisfy
hk,0=0,hk,1=1\+θkλk,h\_\{k,0\}=0,\\qquad h\_\{k,1\}=1\+\\theta\_\{k\}\\lambda\_\{k\},and, forj≥2j\\geq 2,
hk,j=−θkλk2\(1−λk\)j−2\.h\_\{k,j\}=\-\\theta\_\{k\}\\lambda\_\{k\}^\{2\}\(1\-\\lambda\_\{k\}\)^\{j\-2\}\.By Parseval’s identity,
Γk=12π∫−ππgk\(ω\)𝑑ω=∑j=0∞hk,j2\.\\Gamma\_\{k\}=\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}g\_\{k\}\(\\omega\)\\,d\\omega=\\sum\_\{j=0\}^\{\\infty\}h\_\{k,j\}^\{2\}\.Thus
Γk=\(1\+θkλk\)2\+θk2λk4∑j=0∞\(1−λk\)2j\.\\Gamma\_\{k\}=\(1\+\\theta\_\{k\}\\lambda\_\{k\}\)^\{2\}\+\\theta\_\{k\}^\{2\}\\lambda\_\{k\}^\{4\}\\sum\_\{j=0\}^\{\\infty\}\(1\-\\lambda\_\{k\}\)^\{2j\}\.Since
∑j=0∞\(1−λk\)2j=11−\(1−λk\)2=1λk\(2−λk\),\\sum\_\{j=0\}^\{\\infty\}\(1\-\\lambda\_\{k\}\)^\{2j\}=\\frac\{1\}\{1\-\(1\-\\lambda\_\{k\}\)^\{2\}\}=\\frac\{1\}\{\\lambda\_\{k\}\(2\-\\lambda\_\{k\}\)\},we obtain
Γk=\(1\+θkλk\)2\+θk2λk32−λk\.\\Gamma\_\{k\}=\(1\+\\theta\_\{k\}\\lambda\_\{k\}\)^\{2\}\+\\frac\{\\theta\_\{k\}^\{2\}\\lambda\_\{k\}^\{3\}\}\{2\-\\lambda\_\{k\}\}\.Equivalently,
Γk=1\+2θkλk\+2θk2λk22−λk\.\\Gamma\_\{k\}=1\+2\\theta\_\{k\}\\lambda\_\{k\}\+\\frac\{2\\theta\_\{k\}^\{2\}\\lambda\_\{k\}^\{2\}\}\{2\-\\lambda\_\{k\}\}\.The expression is increasing in bothθk\\theta\_\{k\}andλk\\lambda\_\{k\}onθk\>0\\theta\_\{k\}\>0,λk∈\(0,1\]\\lambda\_\{k\}\\in\(0,1\]\. Therefore
Γk≥1\+2θλ\+2θ2λ22−λ=Γ\.\\Gamma\_\{k\}\\geq 1\+2\\theta\\lambda\+\\frac\{2\\theta^\{2\}\\lambda^\{2\}\}\{2\-\\lambda\}=\\Gamma\.Finally,Γ\>1\\Gamma\>1becauseθ\>0\\theta\>0andλ\>0\\lambda\>0\. ∎
###### Lemma 2\(Variance lower bound for independent inputs\)\.
Let
A\(ℒ\)=∑j=0∞ajℒjA\(\\mathcal\{L\}\)=\\sum\_\{j=0\}^\{\\infty\}a\_\{j\}\\mathcal\{L\}^\{j\}be a square\-summable linear filter\. Let\{Zt\}\\\{Z\_\{t\}\\\}be centered and independent across time, with
Var\(Zt\)≥σ2\>0for allt\.\\operatorname\{Var\}\(Z\_\{t\}\)\\geq\\sigma^\{2\}\>0\\qquad\\text\{for all \}t\.Define
Yt=A\(ℒ\)Zt=∑j=0∞ajZt−j\.Y\_\{t\}=A\(\\mathcal\{L\}\)Z\_\{t\}=\\sum\_\{j=0\}^\{\\infty\}a\_\{j\}Z\_\{t\-j\}\.Then
Var\(Yt\)≥σ2∑j=0∞aj2\.\\operatorname\{Var\}\(Y\_\{t\}\)\\geq\\sigma^\{2\}\\sum\_\{j=0\}^\{\\infty\}a\_\{j\}^\{2\}\.Equivalently,
Var\(Yt\)≥σ22π∫−ππ\|A\(e−iω\)\|2𝑑ω\.\\operatorname\{Var\}\(Y\_\{t\}\)\\geq\\frac\{\\sigma^\{2\}\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}\|A\(e^\{\-i\\omega\}\)\|^\{2\}\\,d\\omega\.
###### Proof\.
Because the inputs are centered and independent across time,
Var\(Yt\)=Var\(∑j=0∞ajZt−j\)=∑j=0∞aj2Var\(Zt−j\)\.\\operatorname\{Var\}\(Y\_\{t\}\)=\\operatorname\{Var\}\\left\(\\sum\_\{j=0\}^\{\\infty\}a\_\{j\}Z\_\{t\-j\}\\right\)=\\sum\_\{j=0\}^\{\\infty\}a\_\{j\}^\{2\}\\operatorname\{Var\}\(Z\_\{t\-j\}\)\.Using the lower boundVar\(Zt−j\)≥σ2\\operatorname\{Var\}\(Z\_\{t\-j\}\)\\geq\\sigma^\{2\}, we obtain
Var\(Yt\)≥σ2∑j=0∞aj2\.\\operatorname\{Var\}\(Y\_\{t\}\)\\geq\\sigma^\{2\}\\sum\_\{j=0\}^\{\\infty\}a\_\{j\}^\{2\}\.The frequency\-domain expression follows from Parseval’s identity:
∑j=0∞aj2=12π∫−ππ\|A\(e−iω\)\|2𝑑ω\.\\sum\_\{j=0\}^\{\\infty\}a\_\{j\}^\{2\}=\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}\|A\(e^\{\-i\\omega\}\)\|^\{2\}\\,d\\omega\.∎
###### Lemma 3\(Product\-gain lower bound\)\.
LetΩ\\Omegabe uniformly distributed on\[−π,π\]\[\-\\pi,\\pi\]\. Supposef1,…,fmf\_\{1\},\\dots,f\_\{m\}are nonnegative functions of1−cosΩ1\-\\cos\\Omegathat are nondecreasing in1−cosΩ1\-\\cos\\Omega\. Then
𝔼\[∏r=1mfr\(Ω\)\]≥∏r=1m𝔼\[fr\(Ω\)\]\.\\mathbb\{E\}\\left\[\\prod\_\{r=1\}^\{m\}f\_\{r\}\(\\Omega\)\\right\]\\geq\\prod\_\{r=1\}^\{m\}\\mathbb\{E\}\[f\_\{r\}\(\\Omega\)\]\.In particular,
12π∫−ππ∏r=1mgr\(ω\)dω≥∏r=1mΓr≥Γm\.\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}\\prod\_\{r=1\}^\{m\}g\_\{r\}\(\\omega\)\\,d\\omega\\geq\\prod\_\{r=1\}^\{m\}\\Gamma\_\{r\}\\geq\\Gamma^\{m\}\.
###### Proof\.
LetU=1−cosΩU=1\-\\cos\\Omega\. Eachfrf\_\{r\}is a nonnegative nondecreasing function of the same scalar random variableUU\.
For two nondecreasing functionsffandhh, letU′U^\{\\prime\}be an independent copy ofUU\. Then
Cov\(f\(U\),h\(U\)\)=12𝔼\[\(f\(U\)−f\(U′\)\)\(h\(U\)−h\(U′\)\)\]≥0,\\operatorname\{Cov\}\(f\(U\),h\(U\)\)=\\frac\{1\}\{2\}\\mathbb\{E\}\\left\[\(f\(U\)\-f\(U^\{\\prime\}\)\)\(h\(U\)\-h\(U^\{\\prime\}\)\)\\right\]\\geq 0,because the two factors always have the same sign\. Therefore
𝔼\[f\(U\)h\(U\)\]≥𝔼\[f\(U\)\]𝔼\[h\(U\)\]\.\\mathbb\{E\}\[f\(U\)h\(U\)\]\\geq\\mathbb\{E\}\[f\(U\)\]\\mathbb\{E\}\[h\(U\)\]\.Applying this argument repeatedly gives
𝔼\[∏r=1mfr\(U\)\]≥∏r=1m𝔼\[fr\(U\)\]\.\\mathbb\{E\}\\left\[\\prod\_\{r=1\}^\{m\}f\_\{r\}\(U\)\\right\]\\geq\\prod\_\{r=1\}^\{m\}\\mathbb\{E\}\[f\_\{r\}\(U\)\]\.
By Lemma[1](https://arxiv.org/html/2605.17036#Thmlemma1), eachgr\(ω\)g\_\{r\}\(\\omega\)is a nonnegative nondecreasing function of1−cosω1\-\\cos\\omega\. Hence
12π∫−ππ∏r=1mgr\(ω\)dω≥∏r=1m12π∫−ππgr\(ω\)𝑑ω=∏r=1mΓr\.\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}\\prod\_\{r=1\}^\{m\}g\_\{r\}\(\\omega\)\\,d\\omega\\geq\\prod\_\{r=1\}^\{m\}\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}g\_\{r\}\(\\omega\)\\,d\\omega=\\prod\_\{r=1\}^\{m\}\\Gamma\_\{r\}\.Since eachΓr≥Γ\\Gamma\_\{r\}\\geq\\Gamma, the result follows\. ∎
###### Proof of Theorem[1](https://arxiv.org/html/2605.17036#Thmtheorem1)\.
Taking conditional expectation over the decision shocks gives the demand channel
𝒒¯k=Hk\(ℒ\)𝒒¯k−1,𝒒¯0=𝑫\.\\bar\{\\bm\{q\}\}\_\{k\}=H\_\{k\}\(\\mathcal\{L\}\)\\bar\{\\bm\{q\}\}\_\{k\-1\},\\qquad\\bar\{\\bm\{q\}\}\_\{0\}=\{\\bm\{D\}\}\.Therefore
𝒒¯k=\(∏r=1kHr\(ℒ\)\)𝑫\.\\bar\{\\bm\{q\}\}\_\{k\}=\\left\(\\prod\_\{r=1\}^\{k\}H\_\{r\}\(\\mathcal\{L\}\)\\right\)\{\\bm\{D\}\}\.Define the composite demand filter
Ak\(ℒ\):=∏r=1kHr\(ℒ\)\.A\_\{k\}\(\\mathcal\{L\}\):=\\prod\_\{r=1\}^\{k\}H\_\{r\}\(\\mathcal\{L\}\)\.Then
\|Ak\(e−iω\)\|2=∏r=1kgr\(ω\)\.\|A\_\{k\}\(e^\{\-i\\omega\}\)\|^\{2\}=\\prod\_\{r=1\}^\{k\}g\_\{r\}\(\\omega\)\.
By Lemma[2](https://arxiv.org/html/2605.17036#Thmlemma2), sinceDtD\_\{t\}is centered, independent across time, and satisfiesVar\(Dt\)≥σD2\\operatorname\{Var\}\(D\_\{t\}\)\\geq\\sigma\_\{D\}^\{2\},
VkD=VarD\(q¯k,t\)≥σD22π∫−ππ∏r=1kgr\(ω\)dω\.V^\{D\}\_\{k\}=\\operatorname\{Var\}\_\{D\}\(\\bar\{q\}\_\{k,t\}\)\\geq\\frac\{\\sigma\_\{D\}^\{2\}\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}\\prod\_\{r=1\}^\{k\}g\_\{r\}\(\\omega\)\\,d\\omega\.By Lemma[3](https://arxiv.org/html/2605.17036#Thmlemma3),
12π∫−ππ∏r=1kgr\(ω\)dω≥∏r=1kΓr\.\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}\\prod\_\{r=1\}^\{k\}g\_\{r\}\(\\omega\)\\,d\\omega\\geq\\prod\_\{r=1\}^\{k\}\\Gamma\_\{r\}\.Finally, Lemma[1](https://arxiv.org/html/2605.17036#Thmlemma1)gives
Γr≥Γ\>1for allr\.\\Gamma\_\{r\}\\geq\\Gamma\>1\\qquad\\text\{for all \}r\.Hence
VkD≥σD2∏r=1kΓr≥σD2Γk\.V^\{D\}\_\{k\}\\geq\\sigma\_\{D\}^\{2\}\\prod\_\{r=1\}^\{k\}\\Gamma\_\{r\}\\geq\\sigma\_\{D\}^\{2\}\\Gamma^\{k\}\.This proves the claim\. ∎
###### Proof of Theorem[2](https://arxiv.org/html/2605.17036#Thmtheorem2)\.
Fix a demand pathD=dD=d\. Define the centered decision\-driven deviation
xk,t\(d\)=qk,t−𝔼ϵ\[qk,t∣D=d\]\.x\_\{k,t\}\(d\)=q\_\{k,t\}\-\\mathbb\{E\}\_\{\\epsilon\}\[q\_\{k,t\}\\mid D=d\]\.Subtracting the demand\-channel recursion from the full transfer representation gives
𝒙k=Hk\(ℒ\)𝒙k−1\+G\(ℒ\)ϵk,𝒙0=0\.\{\\bm\{x\}\}\_\{k\}=H\_\{k\}\(\\mathcal\{L\}\)\{\\bm\{x\}\}\_\{k\-1\}\+G\(\\mathcal\{L\}\)\{\\bm\{\\epsilon\}\}\_\{k\},\\qquad\{\\bm\{x\}\}\_\{0\}=0\.Iterating this recursion gives
𝒙k=∑j=1k\(∏r=j\+1kHr\(ℒ\)\)G\(ℒ\)ϵj,\{\\bm\{x\}\}\_\{k\}=\\sum\_\{j=1\}^\{k\}\\left\(\\prod\_\{r=j\+1\}^\{k\}H\_\{r\}\(\\mathcal\{L\}\)\\right\)G\(\\mathcal\{L\}\)\{\\bm\{\\epsilon\}\}\_\{j\},where the product is interpreted as the identity operator whenj=kj=k\.
Define
Aj,k\(ℒ\):=\(∏r=j\+1kHr\(ℒ\)\)G\(ℒ\)\.A\_\{j,k\}\(\\mathcal\{L\}\):=\\left\(\\prod\_\{r=j\+1\}^\{k\}H\_\{r\}\(\\mathcal\{L\}\)\\right\)G\(\\mathcal\{L\}\)\.Then
𝒙k=∑j=1kAj,k\(ℒ\)ϵj\.\{\\bm\{x\}\}\_\{k\}=\\sum\_\{j=1\}^\{k\}A\_\{j,k\}\(\\mathcal\{L\}\)\{\\bm\{\\epsilon\}\}\_\{j\}\.
Because the shock processes are independent across tiers, the summands are mutually independent\. Therefore
Vkϵ=Varϵ\(xk,t∣D=d\)=∑j=1kVarϵ\(\[Aj,k\(ℒ\)ϵj\]t\)\.V^\{\\epsilon\}\_\{k\}=\\operatorname\{Var\}\_\{\\epsilon\}\(x\_\{k,t\}\\mid D=d\)=\\sum\_\{j=1\}^\{k\}\\operatorname\{Var\}\_\{\\epsilon\}\\left\(\[A\_\{j,k\}\(\\mathcal\{L\}\)\{\\bm\{\\epsilon\}\}\_\{j\}\]\_\{t\}\\right\)\.
For eachjj, since\{ϵj,t\}\\\{\\epsilon\_\{j,t\}\\\}is centered and independent over time with variance lower bounded byσj2\\sigma\_\{j\}^\{2\}, the variance of the filtered process is
Varϵ\(\[Aj,k\(ℒ\)ϵj\]t\)≥σj22π∫−ππ\|Aj,k\(e−iω\)\|2𝑑ω\.\\operatorname\{Var\}\_\{\\epsilon\}\\left\(\[A\_\{j,k\}\(\\mathcal\{L\}\)\{\\bm\{\\epsilon\}\}\_\{j\}\]\_\{t\}\\right\)\\geq\\frac\{\\sigma\_\{j\}^\{2\}\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}\|A\_\{j,k\}\(e^\{\-i\\omega\}\)\|^\{2\}\\,d\\omega\.Moreover,
\|Aj,k\(e−iω\)\|2=\|G\(e−iω\)\|2∏r=j\+1k\|Hr\(e−iω\)\|2=b\(ω\)∏r=j\+1kgr\(ω\)\.\|A\_\{j,k\}\(e^\{\-i\\omega\}\)\|^\{2\}=\|G\(e^\{\-i\\omega\}\)\|^\{2\}\\prod\_\{r=j\+1\}^\{k\}\|H\_\{r\}\(e^\{\-i\\omega\}\)\|^\{2\}=b\(\\omega\)\\prod\_\{r=j\+1\}^\{k\}g\_\{r\}\(\\omega\)\.Thus
Vkϵ≥12π∫−ππb\(ω\)∑j=1kσj2∏r=j\+1kgr\(ω\)dω\.V^\{\\epsilon\}\_\{k\}\\geq\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}b\(\\omega\)\\sum\_\{j=1\}^\{k\}\\sigma\_\{j\}^\{2\}\\prod\_\{r=j\+1\}^\{k\}g\_\{r\}\(\\omega\)\\,d\\omega\.
Next,b\(ω\)b\(\\omega\)and eachgr\(ω\)g\_\{r\}\(\\omega\)are nonnegative nondecreasing functions of1−cosω1\-\\cos\\omega\. Hence the product\-gain lower\-bound lemma gives
12π∫−ππb\(ω\)∏r=j\+1kgr\(ω\)dω≥\(12π∫−ππb\(ω\)𝑑ω\)∏r=j\+1k\(12π∫−ππgr\(ω\)𝑑ω\)\.\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}b\(\\omega\)\\prod\_\{r=j\+1\}^\{k\}g\_\{r\}\(\\omega\)\\,d\\omega\\geq\\left\(\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}b\(\\omega\)\\,d\\omega\\right\)\\prod\_\{r=j\+1\}^\{k\}\\left\(\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}g\_\{r\}\(\\omega\)\\,d\\omega\\right\)\.Since
12π∫−ππb\(ω\)𝑑ω=12π∫−ππ4sin2\(ω/2\)𝑑ω=2,\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}b\(\\omega\)\\,d\\omega=\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}4\\sin^\{2\}\(\\omega/2\)\\,d\\omega=2,and since
Γr=12π∫−ππgr\(ω\)𝑑ω,\\Gamma\_\{r\}=\\frac\{1\}\{2\\pi\}\\int\_\{\-\\pi\}^\{\\pi\}g\_\{r\}\(\\omega\)\\,d\\omega,we obtain
Vkϵ≥2∑j=1kσj2∏r=j\+1kΓr\.V^\{\\epsilon\}\_\{k\}\\geq 2\\sum\_\{j=1\}^\{k\}\\sigma\_\{j\}^\{2\}\\prod\_\{r=j\+1\}^\{k\}\\Gamma\_\{r\}\.
Finally, if
θr≥θ\>0,λr∈\[λ,1\],\\theta\_\{r\}\\geq\\theta\>0,\\qquad\\lambda\_\{r\}\\in\[\\lambda,1\],then Lemma[1](https://arxiv.org/html/2605.17036#Thmlemma1)gives
Γr≥Γ\>1\\Gamma\_\{r\}\\geq\\Gamma\>1for everyrr\. Hence
∏r=j\+1kΓr≥Γk−j\.\\prod\_\{r=j\+1\}^\{k\}\\Gamma\_\{r\}\\geq\\Gamma^\{k\-j\}\.Therefore
Vkϵ≥2∑j=1kσj2Γk−j\.V^\{\\epsilon\}\_\{k\}\\geq 2\\sum\_\{j=1\}^\{k\}\\sigma\_\{j\}^\{2\}\\Gamma^\{k\-j\}\.This proves \([24](https://arxiv.org/html/2605.17036#S5.E24)\)\. ∎
###### Proof of Proposition[3](https://arxiv.org/html/2605.17036#Thmproposition3)\.
Conditional onD=dD=d, define
xk,t\(d\):=qk,t−𝔼\[qk,t∣D=d\]\.x\_\{k,t\}\(d\):=q\_\{k,t\}\-\\mathbb\{E\}\[q\_\{k,t\}\\mid D=d\]\.The deterministic demand path affects only the conditional mean, so
Wk,t\(d\)=Var\(xk,t\(d\)\)\.W\_\{k,t\}\(d\)=\\operatorname\{Var\}\(x\_\{k,t\}\(d\)\)\.The centered linear recursion is
xk,t\(d\)=Hk\(ℒ\)xk−1,t\(d\)\+G\(ℒ\)ϵk,t,x0,t\(d\)=0\.x\_\{k,t\}\(d\)=H\_\{k\}\(\\mathcal\{L\}\)x\_\{k\-1,t\}\(d\)\+G\(\\mathcal\{L\}\)\\epsilon\_\{k,t\},\\qquad x\_\{0,t\}\(d\)=0\.Iterating this recursion gives
xk,t\(d\)=∑j=1kBk,j\(ℒ\)ϵj,t,x\_\{k,t\}\(d\)=\\sum\_\{j=1\}^\{k\}B\_\{k,j\}\(\\mathcal\{L\}\)\\epsilon\_\{j,t\},where
Bk,j\(ℒ\):=\(∏h=j\+1kHh\(ℒ\)\)G\(ℒ\),B\_\{k,j\}\(\\mathcal\{L\}\):=\\left\(\\prod\_\{h=j\+1\}^\{k\}H\_\{h\}\(\\mathcal\{L\}\)\\right\)G\(\\mathcal\{L\}\),with the empty product equal to one\. Write
Bk,j\(ℒ\)=∑m=0∞bk,j,mℒm\.B\_\{k,j\}\(\\mathcal\{L\}\)=\\sum\_\{m=0\}^\{\\infty\}b\_\{k,j,m\}\\mathcal\{L\}^\{m\}\.Because the system starts from deterministic initial conditions and the shock pre\-history is zero, only shocks from the firstttperiods contribute to the period\-ttdeviation\. Thus
xk,t\(d\)=∑j=1k∑m=0t−1bk,j,mϵj,t−m\.x\_\{k,t\}\(d\)=\\sum\_\{j=1\}^\{k\}\\sum\_\{m=0\}^\{t\-1\}b\_\{k,j,m\}\\epsilon\_\{j,t\-m\}\.Independence across tiers and time eliminates all covariance terms, so
Wk,t\(d\)=∑j=1kσj2∑m=0t−1bk,j,m2\.W\_\{k,t\}\(d\)=\\sum\_\{j=1\}^\{k\}\\sigma\_\{j\}^\{2\}\\sum\_\{m=0\}^\{t\-1\}b\_\{k,j,m\}^\{2\}\.Similarly,
Wk,t\+1\(d\)=∑j=1kσj2∑m=0tbk,j,m2\.W\_\{k,t\+1\}\(d\)=\\sum\_\{j=1\}^\{k\}\\sigma\_\{j\}^\{2\}\\sum\_\{m=0\}^\{t\}b\_\{k,j,m\}^\{2\}\.Subtracting yields
Wk,t\+1\(d\)−Wk,t\(d\)=∑j=1kσj2bk,j,t2≥0\.W\_\{k,t\+1\}\(d\)\-W\_\{k,t\}\(d\)=\\sum\_\{j=1\}^\{k\}\\sigma\_\{j\}^\{2\}b\_\{k,j,t\}^\{2\}\\geq 0\.This proves finite\-horizon intertemporal accumulation of decision unreliability\. ∎
### B\.1Simulation Under Operational Constraints
The analytical results in Section[5](https://arxiv.org/html/2605.17036#S5)are derived under the linear benchmark system\. In practice, supply chains operate under nonlinear constraints that are deliberately abstracted away in the benchmark model\. These include:
- •nonnegative order quantities,
- •state\-dependent decision shocks,
- •adjusted safety stocks\.
The full simulation model therefore uses the operational dynamics from Section[5\.1](https://arxiv.org/html/2605.17036#S5.SS1)\. In particular, simulations use
qk,t=\[θkq^k,t\+ϵk,t−IPk,t\]\+\.q\_\{k,t\}=\\left\[\\theta\_\{k\}\\hat\{q\}\_\{k,t\}\+\\epsilon\_\{k,t\}\-IP\_\{k,t\}\\right\]^\{\+\}\.
The simulations serve two purposes\. First, they test whether the amplification mechanism identified in the linear benchmark persists under nonlinear operational constraints\. Second, they quantify how nonnegative ordering, backlog dynamics, and shipment constraints modify the magnitude of the bullwhip effect\.Similar Articles
IP Memorandum: Multi-Agent ("Agentic") AI Systems in Coding, Marketing, and Creation – Comprehensive 2026 Analysis. (Integrating Patentability, Hype vs. Reality, Human Dependency, and Cost Overruns)
This comprehensive analysis examines multi-agent AI systems in coding, marketing, and creation, arguing that despite vendor hype about autonomy and efficiency, these systems remain heavily dependent on human input, face patentability and copyright limitations, and have led to cost overruns at major tech companies like Microsoft and Uber, questioning their sustainable value.
An Agentic AI Framework with Large Language Models and Chain-of-Thought for UAV-Assisted Logistics Scheduling with Mobile Edge Computing
The paper presents an agentic AI framework that leverages large language models and chain-of-thought reasoning to optimize UAV-assisted logistics scheduling with mobile edge computing, aiming to improve efficiency and resource allocation in manufacturing logistics.
Multiplayer AI Agents - Next Frontier
This article explores using different AI models as unpredictable opponents in games, specifically a Baseball Manager game. The author tests 8 models and finds they exhibit different decision-making patterns, suggesting that model origin and training influence behavior, enabling varied AI personalities for more engaging gameplay.
Position: Agentic AI System Is a Foreseeable Pathway to AGI
This paper argues that monolithic scaling of a single model is insufficient for achieving AGI and proposes Agentic AI with multi-agent collaboration as a necessary paradigm, demonstrating theoretically that agentic systems achieve exponentially superior generalization and sample efficiency.
The Real Truth About AI Agents
An experienced practitioner shares hard-won lessons from deploying 25+ AI agents to production, arguing that memory, orchestration, and auditability matter far more than model choice. The article details common failure modes like context loss and silent cost loops, and recommends a stack including Claude Sonnet 4, Pydantic AI, and dedicated memory layers like Octopodas.