Tag
This paper presents ZCube, a novel network architecture developed by Z.ai, Harnets.AI, and Tsinghua University to address topology-induced congestion in Prefill-Decode disaggregated LLM inference clusters. Production deployments on GLM-5.1 coding workloads achieved a 33% reduction in network CapEx, 15% throughput improvement, and 40.6% reduction in TTFT P99 latency.