@LinQingV: When exploring LLM inference chip architectures previously, I reviewed the architectures of the four major AI inference ASIC companies: Groq, SambaNova, Tenstorrent, and Cerebras. While the first three have different emphases, their underlying logic falls within the same framework: large on-chip SRAM + dataflow architecture + deterministic scheduling...

X AI KOLs Timeline News

Summary

The article analyzes the AI inference ASIC architectures of Groq, SambaNova, Tenstorrent, and Cerebras, highlighting Cerebras's unique wafer-scale engine design. It discusses the benefits of deterministic latency and high bandwidth for LLM inference, while noting challenges like yield, cost, and KV cache bottlenecks.

When exploring LLM inference chip architectures previously, I reviewed the architectures of the four major AI inference ASIC companies: Groq, SambaNova, Tenstorrent, and Cerebras. While the first three have different emphases, their underlying logic falls within the same framework: large on-chip SRAM + dataflow architecture + deterministic scheduling, with core differences unfolding across dimensions such as NoC topology, memory hierarchy, and compiler abstractions. Cerebras is the company that truly shocked me, and it is the first among the four to achieve an IPO. This company's choice is an order of magnitude more aggressive than the other three: instead of making chips, they directly make entire wafers. A single WSE-3 is a whole 21.5cm × 21.5cm wafer, with 900,000 PEs physically connected into a single continuous silicon via scribe-line stitching. This process was jointly customized by Cerebras and TSMC, transforming the narrow strips originally used for wafer cutting into metal interconnects across reticles, allowing all reticles to physically stitch together into a single large chip. (Figure 2 shows the internal structure of a single WSE-3: the left half displays the reticle grid and scribe-line stitching of the entire wafer, while the right half zooms in on the microarchitecture of a single PE.) The structure of a single PE is extremely simple: an 8-wide FP16 SIMD compute core directly connected to 48KB of local SRAM, with no cache hierarchy, ensuring all data access is deterministic and single-cycle. It includes a 5-port router (N/S/E/W + loopback), so communication latency between adjacent PEs is also single-cycle. Crucially, the mesh across reticle boundaries has identical physical parameters to those inside the reticles, meaning the compiler and runtime do not need to perceive the existence of reticle boundaries. From the perspective of LLM inference, this uniformity is highly valuable. The bottleneck in LLM inference lies in the decode phase. For each token generated, the model weights must be fully read once, but the compute workload is small, creating a typical memory-bound scenario. The core problem for GPU clusters at this stage is data movement: HBM bandwidth is limited, and communication between multiple cards must pass through four layers of interconnects—NVLink → NVSwitch → InfiniBand → Ethernet—each with bandwidth and latency differences spanning several orders of magnitude, requiring programming models to explicitly handle the topological boundaries of each layer. Cerebras completely bypasses this issue. The intra-wafer fabric bandwidth is 27 PB/s. After weights flow from the external MemoryX storage cluster into the wafer via SwarmX, they propagate between PEs in a dataflow execution mode, with the same placement and routing algorithms running across the entire wafer. (Figure 1 illustrates this system-level architecture: from the MemoryX parameter storage cluster to the SwarmX interconnect fabric, down to up to 2048 CS-3 nodes at the bottom, making the dataflow direction for weight broadcasting and gradient reduction clear.) With 900,000 PEs each carrying 48KB of SRAM, the total on-chip storage is approximately 42GB. Each PE's access to its local SRAM is deterministic and single-cycle, and PE-to-PE communication is single-cycle per hop, with latency proportional to Manhattan distance. For inference scenarios, provided that the weight streaming compiler can effectively distribute weights to the corresponding PEs, the aggregate bandwidth of this 42GB distributed on-chip SRAM far exceeds GPU HBM solutions, without the access uncertainty caused by cache hierarchies or the overhead of cross-chip data movement. Reflecting on my own experience, designing inference chip architectures required significant effort in balancing NoC topology and memory hierarchy, because chip boundaries are hard constraints, and there is always a gap between the cost of cross-chip communication and intra-chip communication. Cerebras's approach effectively eliminates this gap from the perspective of intra-chip communication, at the cost of redefining the entire manufacturing and packaging chain. This also explains Cerebras's engineering trade-offs. All architectural innovations are concentrated within the wafer, while scaling out directly leverages the 100GbE + RoCE Ethernet ecosystem. The intra-wafer bandwidth of 27 PB/s compared to the Tbps-level SwarmX interconnect across CS-3 nodes represents a difference of several orders of magnitude, all handed over to commodity networks. In inference scenarios, the bandwidth and latency advantages within a single wafer can be directly translated into token generation speed. OpenAI's choice to partner with Cerebras for inference makes logical sense from an architectural standpoint. Large-scale online inference requires low latency, high throughput, and deterministic latency, three points that happen to be the structural advantages of wafer-scale architecture in terms of on-chip communication uniformity. However, this architecture also has several structural issues worth addressing. Yield and cost are unavoidable. Using an entire wafer as a single chip means that defects in any reticle affect the whole. Cerebras relies on redundant PEs and routing bypasses to handle this, but the redundancy ratio and yield data have never been publicly disclosed. The manufacturing cost of a single wafer is inherently much higher than selling individual dies after cutting, and combined with a single-system power consumption of 23kW and a volume of 15U, the deployment density and TCO face tests in the economics of large-scale inference clusters. The most critical issue is the capacity bottleneck of the KV cache. 42GB of on-chip SRAM seems large, but in long-context inference scenarios, the KV cache grows linearly with sequence length. Taking Llama 70B as a reference, the KV cache for a 128K context in FP16 consumes about 40GB; even with KV cache quantization, the capacity pressure in long-sequence scenarios remains significant. Parts that cannot fit on-chip must rely on external MemoryX storage, requiring data to be retrieved via SwarmX. The bandwidth of this path is in the Tbps range, and the gap between this and the wafer's internal 27 PB/s means that decode speed in long-sequence scenarios will be bottlenecked by external bandwidth. This may be the most core architectural constraint Cerebras faces in inference scenarios.
Original Article

Similar Articles

@MaxForAI: http://Z.ai and this ZCube paper from Tsinghua—worth a read for anyone in Infra. Many people's first reaction when talking about AI infra is still GPU, memory, quantization, and inference frameworks. But once you get into long context and Prefill-Decode separation, the network is no longer just a 'supporting role' in the data center. Every...

X AI KOLs Timeline

ZCube is a new network architecture that flattens the topology and mixes single/multi-rail access to optimize KV Cache transmission in long-context and PD separation scenarios. In the GLM-5.1 production cluster, it achieved a 33% reduction in switch/optical module costs, a 15% increase in GPU inference throughput, and a 40.6% decrease in TTFT P99.

Inference Engines for LLMs & Local AI Hardware (2026 Edition)

X AI KOLs

This article provides a comprehensive guide to LLM inference engines for local AI hardware in 2026, explaining how to choose based on hardware strategy, workload, and serving model, and covering engines like llama.cpp, MLX, ExLlamaV2/3, vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo.