The Price of Anarchy in Disaggregated Inference
Summary
This paper presents a game-theoretic analysis of disaggregated inference architectures that separate prefill and decode phases across GPU pools, characterizing how GPU saturation affects performance. The authors propose an adaptive controller that detects saturation transitions and adjusts routing parameters, reducing the Price of Anarchy significantly in experiments on NVIDIA B200 clusters.
View Cached Full Text
Cached at: 06/17/26, 03:52 PM
Paper page - The Price of Anarchy in Disaggregated Inference
Source: https://huggingface.co/papers/2606.17081
Abstract
Disaggregated inference architectures separate prefill and decode phases across distinct GPU pools, and a game-theoretic analysis characterizes how GPU saturation affects system performance through regime transitions and payoff structure changes, enabling an adaptive controller to optimize routing and reduce latency.
Disaggregated inferencearchitectures physically separate prefill anddecode phases onto distinctGPU pools, creating competing “agents” that share a fixed hardware budget. We provide, to our knowledge, the first formalgame-theoretic analysisof this architecture, usingNVIDIA Dynamoas a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchicalKV cache, and acongestion gamewith positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game’s payoff structure: below saturation, selfish behavior has boundedPrice of Anarchy(PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design anadaptive controllerthat detectssaturation transitionsin real time and adjustsrouting parametersaccordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13%throughputcost. On the 70B 1P/2D, PoA-hat drops 2.2x andTTFT P99drops 7.6x (see Section 8.5).
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.17081
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.17081 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.17081 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.17081 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
The Inference Shift (8 minute read)
This article analyzes Cerebras' upcoming IPO as a signal of the 'inference shift' in AI hardware, arguing that while Nvidia dominates GPU-based training, the future of AI compute is becoming increasingly heterogeneous to support inference workloads.
AI economics part 2 (11 minute read)
The article analyzes the economics of AI, focusing on the war for GPU resources, contrasts human inference spikes with agentic continuous workloads, and argues that current infrastructure is optimized for human usage, not agentic inference, which is more demanding.
@kazukifujii: Sakura Internet's Michishita-san's article comprehensively summarizes LLM Inference and comes highly recommended. It fe…
This article summarizes a presentation by Junda Chen on disaggregated inference for LLMs, explaining why goodput (throughput meeting latency SLOs) matters more than raw throughput, and how separating prefill and decode phases improves performance. It also highlights the influence on NVIDIA Dynamo.
@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…
This blog post from Anyscale explains the intuition behind Prefill-Decode (PD) disaggregation for LLM serving, showing how separating prefill and decode phases onto dedicated GPUs can achieve up to 2.7x better goodput and 67% cost savings when using Ray and vLLM on AMD MI325X, while also discussing when PD disaggregation does not help.
Inference cost at scale with napkin math (13 minute read)
A technical walkthrough that shows how to estimate the cost of serving AI models at scale using simple napkin math, covering GPU bandwidth, matrix multiplication, token pricing, and user capacity.