The Price of Anarchy in Disaggregated Inference

Hugging Face Daily Papers Papers

Summary

This paper presents a game-theoretic analysis of disaggregated inference architectures that separate prefill and decode phases across GPU pools, characterizing how GPU saturation affects performance. The authors propose an adaptive controller that detects saturation transitions and adjusts routing parameters, reducing the Price of Anarchy significantly in experiments on NVIDIA B200 clusters.

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).
Original Article
View Cached Full Text

Cached at: 06/17/26, 03:52 PM

Paper page - The Price of Anarchy in Disaggregated Inference

Source: https://huggingface.co/papers/2606.17081

Abstract

Disaggregated inference architectures separate prefill and decode phases across distinct GPU pools, and a game-theoretic analysis characterizes how GPU saturation affects system performance through regime transitions and payoff structure changes, enabling an adaptive controller to optimize routing and reduce latency.

Disaggregated inferencearchitectures physically separate prefill anddecode phases onto distinctGPU pools, creating competing “agents” that share a fixed hardware budget. We provide, to our knowledge, the first formalgame-theoretic analysisof this architecture, usingNVIDIA Dynamoas a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchicalKV cache, and acongestion gamewith positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game’s payoff structure: below saturation, selfish behavior has boundedPrice of Anarchy(PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design anadaptive controllerthat detectssaturation transitionsin real time and adjustsrouting parametersaccordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13%throughputcost. On the 70B 1P/2D, PoA-hat drops 2.2x andTTFT P99drops 7.6x (see Section 8.5).

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.17081

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.17081 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.17081 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.17081 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

The Inference Shift (8 minute read)

TLDR AI

This article analyzes Cerebras' upcoming IPO as a signal of the 'inference shift' in AI hardware, arguing that while Nvidia dominates GPU-based training, the future of AI compute is becoming increasingly heterogeneous to support inference workloads.

AI economics part 2 (11 minute read)

TLDR AI

The article analyzes the economics of AI, focusing on the war for GPU resources, contrasts human inference spikes with agentic continuous workloads, and argues that current infrastructure is optimized for human usage, not agentic inference, which is more demanding.