Tag
This paper presents a game-theoretic analysis of disaggregated inference architectures that separate prefill and decode phases across GPU pools, characterizing how GPU saturation affects performance. The authors propose an adaptive controller that detects saturation transitions and adjusts routing parameters, reducing the Price of Anarchy significantly in experiments on NVIDIA B200 clusters.