@Zai_org: https://x.com/Zai_org/status/2057216685040443743
Summary
This paper presents ZCube, a novel network architecture developed by Z.ai, Harnets.AI, and Tsinghua University to address topology-induced congestion in Prefill-Decode disaggregated LLM inference clusters. Production deployments on GLM-5.1 coding workloads achieved a 33% reduction in network CapEx, 15% throughput improvement, and 40.6% reduction in TTFT P99 latency.
View Cached Full Text
Cached at: 05/21/26, 01:36 PM
Next-generation LLM Inference Network: How ZCube Alleviates Network Bottlenecks?
LLM inference is reshaping AI infrastructure. The network used to be the least interesting part of an inference cluster. That isn’t true anymore. With long-context inference and Prefill-Decode disaggregation now standard, the network sits on the critical path of throughput, tail latency, and per-token serving cost.
To address the increasingly severe topology-induced congestion in Prefill-Decode disaggregated deployments, Z.ai, Harnets.AI, and Tsinghua University jointly developed and deployed the ZCube network architecture in an online production environment. The deployment shows that system-level innovation at the network architecture layer can unlock hardware potential in a highly cost-effective way.
In production benchmarking for the GLM-5.1 coding workload, ZCube delivered significant gains through architectural optimization alone:
-
Cost optimization: GPUs, the software stack, and applications remained unchanged, while switch and optical module CapEx was reduced by 33%.
-
Throughput improvement: Average GPU inference throughput increased by 15%.
-
Latency improvement: TTFT P99 was reduced by 40.6%.
The root cause of the congestion lies in the shift of inference traffic patterns. As PD disaggregation becomes mainstream, cross-node KV Cache transfers make inference traffic highly asymmetric, with dynamically changing sources, destinations, and traffic volumes. In traditional ROFT (Rail-Optimized Fat-Tree) architectures, static topology and port mappings can easily concentrate traffic on a limited set of switches and links, causing local hotspots, queue buildup, and PFC backpressure. This leads to a structural issue where aggregate bandwidth appears sufficient, yet localized congestion occurs frequently.
ZCube addresses this issue by using a fully flattened network topology together with a hybrid single-rail / multi-rail access design. At the network architecture layer, it decouples and distributes PD traffic across a broader path space, reducing the probability of topology-induced congestion at its source. This provides a more efficient networking foundation for next-generation hyperscale inference clusters.
Figure 1: ZCube Effectively Avoids Topology-Induced Network Congestion Compared with ROFT
Figure 1: ZCube Effectively Avoids Topology-Induced Network Congestion Compared with ROFT
Network Becoming a Bottleneck for Effective Inference
When thousands of GPUs serve online inference requests concurrently, every KV Cache transfer and every data synchronization operation traverses the inter-GPU network. As long-context inference and Prefill-Decode disaggregated inference gradually become mainstream, data exchange between Prefill and Decode nodes continues to grow. Network bandwidth, and more importantly the ability to use it effectively, has begun to affect cluster-level throughput and latency directly.
To quantify the impact of networking on inference performance, we first conducted an ablation study on a 512-GPU cluster. We kept GPU compute, the software stack, the model, and application logic unchanged, and only adjusted the available NIC bandwidth cap. We then measured changes in overall cluster throughput and Time to First Token (TTFT).
Figure 2: Overall cluster throughput and TTFT under different network bandwidth settings
Figure 2: Overall cluster throughput and TTFT under different network bandwidth settings
For example, when network bandwidth was increased from 100Gbps to 200Gbps, overall inference throughput improved by approximately 19%, while Time to First Token, or TTFT, decreased by approximately 22%. This indicates that, in LLM inference, network bandwidth has become one of the key factors constraining service performance.
1. Network Congestion in Inference
Today, AI clusters commonly use Clos, or Fat-Tree, architectures. The basic idea is to scale the network by stacking multiple layers of switches. However, the performance of Clos networks depends heavily on ideal load balancing across switches, which is difficult to achieve in practice due to routing policies and real traffic patterns.
For example, in many two-tier Fat-Tree deployments, which consist of Spine and Leaf layers, traffic across Spine switches can become severely imbalanced. As a result, upper-layer applications often fail to obtain the expected network performance.
To reduce the overhead of cross-layer forwarding, the industry often adopts ROFT (Rail-Optimized Fat-Tree) architectures [1]. As shown in Figure 3, ROFT groups GPUs by index (“rail”), and connects GPUs with the same index to the same Leaf switch, reducing the communication cost across Spine switches.
Figure 3: In ROFT, traffic among Leaf switches can easily become imbalanced
Figure 3: In ROFT, traffic among Leaf switches can easily become imbalanced
ROFT works well for certain training traffic patterns. However, in Prefill-Decode disaggregated inference, we observed a more prominent issue: KV Cache transfers exhibit strong source-destination asymmetry. Different GPUs and different NICs carry highly uneven communication loads, as shown in Figure 4. As a result, ROFT’s rail mapping no longer naturally translates into load balancing. Instead, traffic can become concentrated on a small number of Leaf switches and links, leading to link congestion and degraded transfer performance.
This manifests in several ways:
-
Some Leaf switches become persistent load hotspots, increasing the probability that multiple KV Cache transfer flows compete on the same links. As a result, actual transfer throughput can fall far below the NIC bandwidth capacity.
-
Certain egress queues on some Leaf switches remain at high depth for extended periods and frequently trigger PFC backpressure, as shown in Figure 5.
-
Link congestion further amplifies tail latency, affecting both TTFT and overall throughput.
Figure 4: KV Cache transfer load imbalance across different NICs on the same machine
Figure 4: KV Cache transfer load imbalance across different NICs on the same machine
Figure 5: Frequent PFC Pause events on selected Leaf switch ports
Figure 5: Frequent PFC Pause events on selected Leaf switch ports
It is important to distinguish between the two types of network congestion, as illustrated in Figure 6:
-
Unavoidable congestion: For example, when multiple GPUs send data to the same destination at the same time, contention on the final-hop link is inevitable.
-
Avoidable congestion: This is caused by topology design, traffic mapping, or imbalanced multipath utilization. Fundamentally, it is an architecture-level design problem.
Figure 6: Illustration of two types of network congestion
Figure 6: Illustration of two types of network congestion
For the first type of congestion, we typically rely on congestion control, traffic shaping, and related mechanisms to mitigate its impact. For the second type, new network transport mechanisms such as adaptive routing [2], packet spraying [3,4], and MRC [5] can help. However, a more effective approach is to prevent network conflicts that should not occur in the first place through innovation at the network architecture layer.
Prefill-Decode disaggregated inference is a typical example. If the network topology cannot match the traffic pattern, the system will repeatedly generate load hotspots and link conflicts. Solving this problem requires rethinking the inference network architecture itself.
2. ZCube Network Architecture
To address the above issues, we deployed a new ZCube network architecture [6]. ZCube breaks away from the traditional Clos design philosophy of hierarchical switch stacking and instead introduces a fully flattened GPU server interconnect.
The ZCube routing strategy, designed specifically for the ZCube architecture, fully leverages the structural properties of the flattened topology. It can achieve near-ideal load balancing across all switches in the network, thereby significantly improving overall cluster network bandwidth.
Compared with Clos, ZCube has a natural advantage in load balancing. This advantage benefits both training clusters and inference clusters. Importantly, ZCube achieves these performance gains while reducing switch and optical module costs by approximately one third compared with Clos. Based on current mainstream switch and NIC configurations, ZCube can support flattened networking for tens of thousands, or even hundreds of thousands, of GPUs.
2.1 ZCube Core Architecture
As shown in Figure 7, the core ideas of ZCube are:
-
Remove the Spine switch layer.
-
Divide Leaf switches into two groups of equal size, typically odd-numbered switches and even-numbered switches.
-
Establish a complete bipartite interconnect between the two switch groups.
-
Connect the two ports of each GPU NIC to the corresponding switches in the two groups using single-rail and multi-rail access patterns.
Figure 7: ZCube architecture overview
Figure 7: ZCube architecture overview
Suppose each GPU has a corresponding NIC with two ports, i.e., p=2. There are n GPUs in total, and GPUs and NICs share the same indices: 1,2,…,n. Let k denote the number of GPUs connected to each switch. The total number of switches is 2n/k, numbered 1,2,…,2n/k. For GPU i, where 1≤i≤n:
- The first port connects to the odd-numbered switch:
((i−1)mod(n/k))×2+1
- The second port connects to the even-numbered switch:
⌈i/k⌉×2
The two switch groups are connected as a complete bipartite graph: every odd-numbered switch connects to every even-numbered switch.
A ZCube topology under dual-port NIC configuration, withp=2,n=32, and k=8, is shown in Figure 7.
2.2 Key Properties of ZCube
Network Diameter
ZCube has a network diameter of two switch hops, meaning any pair of GPUs can reach each other through two switches. This sits between a one-layer switch network, which has one switch hop but limited scale, and a conventional two-layer switch network, which supports a larger scale but typically requires three switch hops and incurs higher latency.
Load Balancing
First, the ZCube routing strategy ensures that each GPU pair has a unique optimal path, avoiding traffic conflicts caused by multipath route selection.
Second, ZCube uses two complementary GPU-to-switch connection patterns. One switch group connects to GPUs in a single-rail pattern, where each switch connects to a contiguous range of GPU IDs. The other switch group connects to GPUs in a multi-rail pattern, where each switch connects to GPUs with the same relative index across groups.
This design enables ZCube to achieve highly effective load balancing across the entire switch fabric under both typical AI training traffic patterns, such as AllReduce and All-to-All, and typical AI inference traffic patterns, where source-destination relationships are uncertain, and NIC loads can be highly imbalanced.
As a result, ZCube can avoid the second type of network congestion described earlier at the architecture layer. As shown in Figure 8, traffic flows that would conflict under ROFT can obtain dedicated network paths under ZCube, thereby avoiding congestion.
Figure 8: Load balancing under the ZCube architecture
Figure 8: Load balancing under the ZCube architecture
Scalability
ZCube provides strong scalability while preserving its favorable performance characteristics. For example, using one layer of 51.2T switches, each with 128 × 400Gbps ports, ZCube can construct a network connecting 16,384 400Gbps NICs. If higher-capacity switches are used, or if the ZCube network is divided into more planes, the architecture can scale further to support interconnection among tens of thousands or even hundreds of thousands of GPUs.
Cost
At the same cluster scale, ZCube can reduce switch and optical module costs by approximately one third compared with traditional Clos / ROFT architectures. For example, in a 10,000-GPU AI cluster, ZCube can save roughly 210 million RMB to 640 million RMB in network hardware investment. These characteristics show that ZCube can achieve better load balancing and performance while requiring lower network hardware cost.
2.3 Real-World Cluster Testing: Boosting Inference Performance While Cutting Network Costs
We upgraded the network architecture of a thousand-GPU cluster running GLM-5.1 coding inference services from the original ROFT to the ZCube architecture. Since the ZCube architecture eliminates the Spine-layer switches found in traditional Clos architectures, the legacy cabling patterns, IP addressing schemes, routing policies, and switch configuration methods established under the Clos framework could not be reused directly, necessitating a complete redesign tailored to ZCube.
To tackle these challenges, the Harnets.AI Network Team designed a comprehensive network solution centered on the ZCube architecture. They developed a suite of automation tools, including the ZCube Controller, a data center layout design tool, and a cabling correctness verification program. This enabled capabilities such as data center deployment planning, cabling validation, automated configuration generation, and batch deployment, effectively resolving numerous hurdles in ZCube deployment. This suite of tools was the critical factor enabling the successful transformation of a large-scale production cluster within an exceptionally tight timeframe.
Following the seamless network architecture migration, we conducted real-world testing on the ZCube architecture by running the GLM-5.1 coding inference services on this cluster. By comparing the cluster’s inference performance before and after the upgrade, we found that ZCube boosted the average GPU inference throughput by over 15% compared to the ROFT architecture (as shown in Figure 9), while dropping the P99 tail latency of TTFT by 40.6%.
Figure 9: Throughput and TTFT Comparison Between ZCube and ROFT Architectures in the Same Cluster
Figure 9: Throughput and TTFT Comparison Between ZCube and ROFT Architectures in the Same Cluster
In summary, for GPU and server hardware of the same scale and configuration, and without modifying any applications, upgrading the networking architecture to ZCube allowed us to not only save 1/3 of the optical modules and switch hardware, but also enable the cluster to serve 15% more inference requests per second. Against the current backdrop of exploding inference workloads and severe shortage of compute resources, this approach proves to be highly pragmatic and valuable. Currently, this ZCube cluster has been running stably for over two weeks, playing a vital role in powering the GLM-5.1 coding inference services.
3. Conclusion
LLM inference is moving from point-wise optimization toward system-level co-design. The coupling between the network and the inference engine is becoming increasingly tight, making networking a critical component of the inference system. The production deployment of ZCube shows that network architecture innovation can directly unlock the effective capacity of inference systems. By better aligning the network architecture with KV Cache transfers and PD traffic patterns, ZCube reduces the probability of topology-induced congestion at the source, improving throughput and latency while enhancing cluster cost efficiency.
Looking ahead to next-generation LLM infrastructure, network design will evolve from general-purpose interconnects toward model-traffic-driven system co-design. Long-context inference, PD disaggregation, MoE, and integrated training-inference workloads are reshaping intra-cluster communication patterns, requiring network topology, communication libraries, and scheduling policies to be jointly optimized around real model traffic. Looking ahead, we will continue pioneering novel AI network architectures for larger-scale inference and training clusters ─ upgrading the network from a foundational GPU connection layer into a core driver of token generation efficiency, system resilience, and cost-effectiveness.
Acknowledgements
ZCube was published at ACM SIGCOMM 2025, and was recognized as “significantly change the way we think about and understand networking.” This is the first large-scale deployment of the technology in a production inference cluster. We thank the Harnets.AI team for their professional support and close collaboration throughout this network architecture upgrade and optimization effort.
Reference
[1] NVIDIA. 2023. SuperPOD: Next Generation Scalable Infrastructure for AI Leadership. https://docs.nvidia.com/https:/docs.nvidia.com/dgx-superpod-reference-architecture-dgx-h100.pdf
[2] NVIDIA. 2025. https://developer.nvidia.com/blog/accelerating-ai-storage-by-up-to-48-with-nvidia-spectrum-x-networking-platform-and-partners/
[3] Ultra Ethernet Consortium. Ultra Ethernet specification v1.0.1, 2025.
[4] Tommaso Bonato, Abdul Kabbani, Ahmad Ghalayini, Michael Papamichael, Mohammad Dohadwala, Lukas Gianinazzi, Mikhail Khalilov, Elias Achermann, Daniele De Sensi, and Torsten Hoefler. REPS: Recycled entropy packet spraying for adaptive load balancing and failure mitigation, 2026.
[5] Araujo, J., Chow, A., Handley, M., Lewis, R., Paasch, C., Padhye, J., … & Sur, S. (2026). Resilient AI Supercomputer Networking using MRC and SRv6. arXiv preprint arXiv:2605.04333.
[6] Yan, Z., Li, D., Chen, L., Xiong, D., Gao, K., Zhang, Y., … & Lin, H. (2025, September). From ATOP to ZCube: Automated topology optimization pipeline and a highly cost-effective network topology for large model training. In Proceedings of the ACM SIGCOMM 2025 Conference (pp. 861-881).
Similar Articles
@MaxForAI: http://Z.ai and this ZCube paper from Tsinghua—worth a read for anyone in Infra. Many people's first reaction when talking about AI infra is still GPU, memory, quantization, and inference frameworks. But once you get into long context and Prefill-Decode separation, the network is no longer just a 'supporting role' in the data center. Every...
ZCube is a new network architecture that flattens the topology and mixes single/multi-rail access to optimize KV Cache transmission in long-context and PD separation scenarios. In the GLM-5.1 production cluster, it achieved a 33% reduction in switch/optical module costs, a 15% increase in GPU inference throughput, and a 40.6% decrease in TTFT P99.
@zhyncs42: Qwen inference team is super great — they achieved 540 TPS on TokenSpeed for agentic workloads Looking forward to them …
Qwen inference team announced TokenSpeed, a high-performance LLM inference engine for agentic workloads, achieving 540 TPS, with open-source preview available.
@bastani_behnam: We just published how we unlocked +50% inference capacity on a 27B model — no new GPUs, no new nodes, at a fraction of …
OpenInfer demonstrates "vertical disaggregation" that boosts Qwen 3.5 27B throughput by ~50% by co-executing quantized layers across a single node’s AMD EPYC CPU and Nvidia L40S GPU with a custom SLA-aware scheduler.
@LinQingV: When exploring LLM inference chip architectures previously, I reviewed the architectures of the four major AI inference ASIC companies: Groq, SambaNova, Tenstorrent, and Cerebras. While the first three have different emphases, their underlying logic falls within the same framework: large on-chip SRAM + dataflow architecture + deterministic scheduling...
The article analyzes the AI inference ASIC architectures of Groq, SambaNova, Tenstorrent, and Cerebras, highlighting Cerebras's unique wafer-scale engine design. It discusses the benefits of deterministic latency and high bandwidth for LLM inference, while noting challenges like yield, cost, and KV cache bottlenecks.
@linexjlin: K2.6 built a Zig LLM inference engine from scratch on Mac in 12h, pushing Qwen 3.5 0.8B from 15 tok/s to 193.1 tok/s
Developer wrote a Zig-based LLM inference engine from zero on macOS in 12 hours, boosting Qwen 3.5 0.8B throughput from 15 to 193 tokens per second.