@snowboat84: https://x.com/snowboat84/status/2061962883651731602

X AI KOLs Timeline News

Summary

This article is the first part of the AI Engineering Panorama series. From a historical perspective, it reviews the evolution of GPUs from gaming graphics cards to AI accelerators, the bold bet of CUDA, the independent path of Google's TPU, and why NVIDIA ultimately prevailed. It also provides a detailed analysis of the underlying logic of AI infrastructure such as chips, supply chain, networking, and power.

https://t.co/fEDSELUq9P
Original Article
View Cached Full Text

Cached at: 06/03/26, 05:45 AM

The Engineering Panorama of AI: The Other Half Beyond the Model (Part 1)

Introduction

Recently, I spent a lot of time organizing and discussing AI models, particularly the Transformer behind large language models and the diffusion models behind image and video generation.

Beyond models, there’s another critically important half of AI: engineering.

On the model side, topics include: how parameters scale, what data is used for training, what scores are achieved on benchmarks, and which architecture wins. But for a model to actually run and be used, it needs more than just the model itself. It requires a full set of engineering: chips, supply chains, power, networking, training clusters, inference engines, quantization, caching, scheduling. Each of these is a hard technology that has evolved independently over the past decade, and together they make AI usable today.

This article is an “Engineering Panorama” written for non-engineers. After reading, you’ll understand: what NVIDIA’s moat really is, why TPUs suddenly made a comeback in 2026, why TSMC’s CoWoS process determines how many AI accelerators the world can produce in a year, why vLLM is several times faster than naive PyTorch inference, why inference optimization is more cost-effective than swapping GPUs, and why “letting the model think a bit longer” is redefining the role of inference.

The article is divided into two parts. Part 1 covers infrastructure (chips, supply chain, networking, power) and runtime (training, inference). Part 2 will cover the modification phase (post-training, alignment) and the autonomous phase (Agent). This is Part 1.

I. Origins of the Hardware Ecosystem

To understand the AI computing landscape in 2026, you can’t start from the present. Today’s 80% NVIDIA market share, TSMC’s CoWoS process bottleneck, and the 2026 return of TPUs are all the compound results of a series of specific choices over the past 20 years. Starting from history makes the causal chain of this landscape clear.

1.1 Prehistory: GPUs Were Gaming Graphics Cards

Go back to the computing ecosystem of the 1990s. The CPU was the absolute protagonist, and Intel was the absolute hegemon. Intel’s x86 processors (386, 486, Pentium series) almost monopolized all PCs and workstations. The “Intel Inside” sticker, starting in 1991, covered computer cases worldwide. AMD was chasing but had a small share. What a computer could do was almost entirely determined by the CPU.

The GPU didn’t even have that name yet. NVIDIA officially coined the term “Graphics Processing Unit” when it released the GeForce 256 in 1999. Before that, the chip was called a “graphics card” or “3D accelerator,” an auxiliary piece of hardware plugged into the motherboard, dedicated to one thing: rendering game graphics. The wave of 3D games from 1996-2000 (Quake, Half-Life, Counter-Strike) pushed the graphics card market, with 3dfx, ATI, and NVIDIA competing for market share. NVIDIA was founded in 1993 but didn’t secure a top-tier position until the GeForce 256 in 1999.

CPU and GPU design philosophies were fundamentally opposite from the start. A CPU is “a few cores doing complex tasks.” A typical modern CPU has 8 to 32 high-performance cores, each capable of handling complex branch logic, OS scheduling, and database transactions. A GPU packs thousands of simple cores onto a single chip. Each core does very simple things (basically floating-point multiply-accumulate) but can process thousands of data points simultaneously. This “few but smart vs. many but dumb” division corresponds to two completely different types of computational needs.

Game graphics happen to be the perfect application scenario for this GPU architecture. Each frame requires projecting millions of triangles in a 3D scene onto a 2D screen and calculating lighting for each pixel. The computation for each pixel is independent of others and can be done in parallel. The thousands of cores in a GPU are designed for this.

Coincidentally, the core operation of neural networks (matrix multiplication) is also highly parallel. Multiplying a 1024x1024 matrix by another 1024x1024 matrix is essentially 1 million independent multiply-accumulate operations. The parallel architecture of GPUs is mathematically a perfect match for this.

But in the 1990s and early 2000s, no one thought of this match. Neural networks were considered “outdated technology” in academia, overshadowed by SVM and boosting. GPU manufacturers themselves only focused on the gaming market and workstation 3D modeling, never reserving any place for AI in their products. CPU was for computing, GPU was for display – the two responsibilities were clearly divided.

1.2 2006: The Bold Gamble of CUDA

The turning point was 2006. NVIDIA released CUDA (Compute Unified Device Architecture), transforming the GPU from “dedicated hardware for rendering game graphics” into a “general-purpose parallel processor capable of running any parallel computation.”

CUDA is a programming interface that allows programmers to write code in C that can be compiled and run directly on the GPU. Before CUDA, using a GPU for computation required disguising the task as a rendering job using graphics APIs like OpenGL, which had a very high barrier to entry. CUDA lowered the barrier to almost the level of writing a normal C program.

Why create CUDA in 2006? It had nothing to do with AI. At that time, neural networks were still being shunned; AlexNet wouldn’t appear for another 6 years. The real motivation came from another line: the scientific computing community had been secretly using GPUs for general-purpose computing. In the early 2000s, physicists, chemists, and financial engineers discovered that the GPU’s parallel architecture was extremely well-suited for numerical simulations (fluid dynamics, molecular dynamics, Monte Carlo option pricing). But using OpenGL to disguise mathematical problems as “rendering pixels” – for example, treating a matrix as a texture map and matrix multiplication as a shader operation on the texture – was extremely difficult to code but tens of times faster than CPUs. In 2003-2004, Stanford PhD student Ian Buck created a programming model called Brook, which allowed GPUs to do this general-purpose computing more naturally. NVIDIA hired him, and Brook essentially became the precursor to CUDA. So CUDA was NVIDIA’s engineering of the GPGPU (General-Purpose GPU computing) needs that had already sprouted in academia.

But at the time, others were doing similar things. AMD launched Stream SDK / Close to Metal (CTM) in 2007. Apple proposed the OpenCL standard in 2008, hoping to create a cross-vendor general-purpose computing API, later standardized by Khronos, supported by AMD/Intel/ARM. Microsoft added DirectCompute to DirectX 11 in 2009. But none of these caught up with CUDA. The reasons are several: CUDA is exclusive to NVIDIA, allowing deep co-design with hardware (later, AI-specific units like Tensor Core are only usable in CUDA). OpenCL’s cross-vendor compatibility came at the cost of single-card performance lagging behind CUDA. NVIDIA’s sustained investment density in documentation, tutorials, and libraries (cuBLAS, cuDNN, cuFFT) was unmatched by competitors for over a decade.

In 2006, this decision looked like a gamble. NVIDIA had to spend a significant amount of R&D investment annually to maintain the CUDA ecosystem, but at the time, there were almost no commercial applications that needed it. Gamers didn’t care if the GPU could do general-purpose computing; they only cared about frame rates. Investors and Wall Street repeatedly questioned this investment. During the 2008 financial crisis, NVIDIA’s stock price halved, but Jensen Huang still insisted on pouring R&D resources into CUDA.

This persistence lasted a full 10 years. From 2006 to 2016, CUDA primarily served high-performance computing in academia (scientific computing, fluid dynamics simulations, financial derivatives pricing). There were few commercial users, but the ecosystem was slowly accumulating: tutorials, libraries, open-source projects, PhD theses.

Why is this important? Because it established a fact: when AI later took off, all researchers found, “Ah, the parallel acceleration I need is already done by CUDA.” And NVIDIA’s competitors had to build a similarly mature development stack from scratch, a gap of ten years.

1.3 Google TPU’s Different Path

CUDA took the “general-purpose parallel” path. Google took a different path: “be extreme for a single purpose.”

In 2013, Google internally realized that deep learning was about to explode, and its own search, advertising, and recommendation systems would all need neural networks. But running these things on GPUs was too expensive and power-hungry. Google decided to build its own chip specifically designed for neural networks, called the TPU (Tensor Processing Unit).

The fundamental difference between a TPU and a GPU is “generality vs. specificity.” A GPU can run many types of parallel tasks, with neural networks being just one. A TPU can hardly do anything but the core operation of neural networks (matrix multiplication and addition), but it can excel at this one thing: the computing power per unit of electricity is several times higher than contemporary GPUs.

The first-generation TPU was deployed in Google’s internal data center in 2015. It was an internal product, not sold externally. Google’s logic was: TensorFlow is open source, but the fastest hardware to run TensorFlow remains for our own use. This is Vertical Integration.

This path has been underestimated by the outside world for 10 years. It wasn’t until 2026 that people realized that Google’s 10-year TPU investment was another hardware path, besides CUDA, capable of competing with NVIDIA. This is a key thread in Chapter 2.

1.4 2012 Turning Point: AlexNet

2012 was the most critical year in AI history, and also the most critical year in GPU history.

That year, at the ImageNet computer vision competition, Geoffrey Hinton’s team from the University of Toronto (his two students Alex Krizhevsky and Ilya Sutskever) submitted a deep convolutional neural network called AlexNet. It reduced the ImageNet top-5 error rate from about 26% the previous year to 15.3%, beating the second-place entry by more than 10 percentage points.

This number itself was seismic in the ML community. But even more seismic was this: the hardware Krizhevsky used to train AlexNet was two consumer-grade NVIDIA GeForce GTX 580 gaming graphics cards.

The GeForce GTX 580 was a gaming card released by NVIDIA in 2010, priced around $500, targeting high-end gamers. Krizhevsky put two cards into his PC, wrote code using NVIDIA’s own CUDA, and far outperformed the top academic teams in image recognition. This event proved two things: neural networks really could work, and NVIDIA’s GPU + CUDA combination was the cheapest and fastest hardware stack for running neural networks.

The aftermath was a chain reaction. In 2013, Baidu, Google, and Facebook each established deep learning labs. Starting around 2015, the “training GPU” in almost all AI papers was an NVIDIA card. In 2017, NVIDIA’s data center business surpassed its gaming business. In 2020, GPT-3 was trained on NVIDIA V100s. In 2023, generative AI exploded, and NVIDIA’s stock price skyrocketed.

But the starting point for all these stories is one thing: a Canadian PhD student bought two gaming graphics cards and proved they could run AI.

1.5 Why NVIDIA Won, and Not Others

By 2026, NVIDIA holds about 80% of the AI GPU market. AMD has about 5-7%. The rest is Google TPU, AWS Trainium, Huawei Ascend, etc.

To understand how this number formed, look back at the competition history from 2012 to 2024. AlexNet in 2012 was just the beginning; NVIDIA winning the entire AI industry happened quietly over the next decade.

From 2012 to 2017, NVIDIA was quietly building its advantage. After AlexNet, all AI researchers started using CUDA. NVIDIA continuously released several architectures (Kepler, Maxwell, Pascal), each strengthening AI-related computing power. In 2014, it released cuDNN (CUDA Deep Neural Network library), optimizing common operators like matrix multiplication, convolution, and attention to the extreme. Every AI framework (Caffe, Theano, Torch, TensorFlow, PyTorch) first ran on CUDA before considering other hardware.

During the same period, Google was not idle. TPU v1 went online internally in 2015, v2 in 2017, v3 in 2018. In terms of technical indicators, TPU’s computing power per watt was a notch above contemporary NVIDIA GPUs (because it was an ASIC specifically designed for neural networks). But the TPU had a fatal limitation: it could only be rented on Google Cloud, not purchased. This locked out 99% of researchers. Even if the TPU was technically more advanced, both industry and academia couldn’t use it.

From 2017 to 2020, NVIDIA solidified its victory. The V100 brought Tensor Core, the first GPU hardware unit specifically designed for matrix multiplication. This step transformed the “general GPU” into an “AI-specific GPU,” boosting single-card training performance by a tier. BERT, GPT-2, and GPT-3 were all trained on V100s.

During the same period, PyTorch (released by Facebook in 2016) gradually replaced TensorFlow as the de facto framework for AI research. PyTorch natively supports CUDA, and its support for TPU is far less mature than TensorFlow. Two things overlapped: TensorFlow, to which TPU was tied, lost momentum; Tensor Core made NVIDIA’s hardware performance catch up with TPU, and Google’s hardware advantage was neutralized.

From 2020 to 2023, the monopoly took shape. The A100 (2020) and H100 (2022) were released successively, with each generation significantly upgrading memory, bandwidth, and Tensor Cores. When ChatGPT exploded in late 2022, the entire industry suddenly needed a massive number of GPUs to train large models. NVIDIA was in short supply, and gross margins rose from 60% to over 70%. AMD’s MI series tried to catch up, but the CUDA ecosystem gap was too large; AI companies preferred to wait in line for H100s rather than switch to AMD. Google’s TPU was still only for rent, with high external usage barriers. During this period, NVIDIA went from “preferred choice” to “de facto monopoly.”

Why ultimately NVIDIA? Many people think it’s the advantage of the chip itself. This understanding is only half right. The real moat is the software ecosystem.

NVIDIA’s moat has three layers.

First, the CUDA software stack. All mainstream AI frameworks (PyTorch, TensorFlow, JAX) first run on CUDA before supporting other hardware. The open-source code for all top conference papers defaults to CUDA. All AI engineers in large companies are accustomed to CUDA. Switching hardware means rewriting code, retuning, and dealing with bugs. This migration cost is extremely high for enterprises.

Second, Tensor Core hardware acceleration. Starting with the Volta architecture in 2017, NVIDIA added specialized matrix multiply units (Tensor Cores) to each generation of GPU, running neural networks several times faster than general floating-point units. This is the advantage at the hardware level.

Third, NVLink high-speed interconnect. For multi-card training, fast data exchange between cards is essential. NVIDIA’s proprietary NVLink is orders of magnitude faster than the general-purpose PCIe bus. Later, the GB200 NVL72 connected 72 cards using NVLink into a “quasi-single GPU,” with bandwidth reaching the level of 1.8 TB/s. This is the advantage at the system level.

The three layers combined form a de facto monopoly. NVIDIA’s FY2026 data center revenue was $193.7 billion, with gross margins above 70%. This is a rare case in tech history where a company simultaneously holds the high ground of both market share and profit margin.

But this landscape began to loosen in 2025-2026. The reason is inference. The next chapter explains.

II. The Landscape Shifts

The biggest change at the hardware layer in 2026 is: for the first time, a meaningful NVIDIA alternative has emerged. But the form of this is different from what many imagine. Its source is that several top AI companies have begun systematically using second and third hardware sets to share the load. No new company has produced something that surpasses NVIDIA at the chip level itself.

First, look at a 2026 AI accelerator market share (estimated by revenue):

When broken down by workload, the picture becomes clearer: NVIDIA accounts for >90% in training, but drops to 60-75% in inference, with the remaining 25-40% going to ASICs and AMD. This breakdown is the specific source of why inference is loosening the monopoly, discussed in Section 2.4.

Two internal usage rate data points help show the scale: Google runs >75% of Gemini on TPUs, and AWS Trainium handles >50% of Bedrock token throughput. More than half of the core workloads on these two platforms have already migrated to their own ASICs. The old narrative that “there is no real production environment outside of NVIDIA” no longer holds.

2.1 The Scaled Return of TPU

In October 2025, Anthropic announced a blockbuster deal: it signed the largest TPU order ever with Google. The specifics: using up to 1 million Ironwood (TPU v7) chips in 2026, with over 1 GW of computing capacity. The total value of this deal is in the tens of billions of dollars.

In April 2026, this deal was expanded again. Anthropic, Google, and Broadcom signed a multi-year agreement adding 3.5 GW of next-generation TPU capacity in 2027. Anthropic’s annualized revenue jumped from about $90 billion at the end of 2025 to over $300 billion in April 2026. The capital expenditure of AI labs has reached the level of tens of billions of dollars on a single order.

The key isn’t just the size of the order. The key is that both Gemini 3 and Claude 4.5 Opus (widely considered the two strongest frontier models of 2026) were trained on TPUs. The significance: TPU has passed the toughest test – “pre-training a frontier model.” For the past 10 years, outsiders have doubted whether TPU could only run inference and compete with NVIDIA GPUs in training. In 2026, these two models provided a positive answer.

How tight is the actual demand? Starting mid-2026, an unusual situation emerged: Google’s own researchers had to wait in line for TPUs, behind external large customers like Anthropic. This state of “selling so much that there’s not enough for internal use” is unprecedented in TPU’s 10-year history.

Another thing: Starting April 2026, Google is selling TPUs externally for the first time in a decade. Previously, TPU could only be rented through Google Cloud, not purchased. Opening sales means Google believes its TPU ecosystem is mature enough to compete head-on with NVIDIA. Revisiting the point in Section 1.5 – “the key reason TPU didn’t win before was that it was only for rent, not for sale” – this step exactly fixes that flaw.

2.2 The Third Pole: AWS Trainium and the Custom Chip Wave

TPU is not the only force challenging NVIDIA. AWS’s custom Trainium (training) and Inferentia (inference) chips form another pole, expanding even faster than TPU in 2026.

In April 2026, Anthropic signed a multi-year agreement with AWS: 5 GW of Trainium capacity + AWS’s $10 billion committed investment. Trainium2 went live at scale in the first half of 2026, with Trainium3 following in the second half. Of all tokens running on AWS Bedrock, >50% are already on Trainium, not on NVIDIA GPUs. This is an internal migration invisible to users of AWS’s platform, but it means Trainium has passed the large-scale production validation in a real environment.

Besides Anthropic as the largest anchor customer, Apple also started using Trainium in 2026 for its Apple Intelligence inference workloads. OpenAI is also using it: a TechCrunch report from March 2026 confirmed that OpenAI is renting Trainium capacity on AWS for some inference. OpenAI runs on both Azure (Maia) and AWS (Trainium), consistent with the “multi-platform coexistence” mentioned earlier.

For AWS Trainium’s outlook over the next 2-3 years, its market share will rise from the current 3-5% to 8-12%. The reasons are several. Anthropic’s 5 GW + $100B lock-up gives Trainium an anchor customer of absolute magnitude, ensuring stable demand for development and iteration. Trainium is releasing two generations in 2026, keeping pace with frontier training. AWS sales have been open from day 1, without the historical baggage of TPU’s “only for rent,” making it easier to get started. AWS itself is the largest cloud platform, with the channel advantage of recommending Trainium to customers by default.

Risks are also clear. Trainium’s single-card performance is a notch weaker than TPU (more conservative design, prioritizing cost), so high-end training market share may be eaten by TPU. Also, the high dependency on a single customer, Anthropic, is risky; if Anthropic has problems or switches platforms, the impact would be significant.

Other ASIC players are also moving. Microsoft’s Maia is mainly for OpenAI inference on Azure. OpenAI’s own custom chip in collaboration with Broadcom is expected to ship in 2026-2027. Meta’s MTIA series has been running on Facebook’s internal recommendation and ad systems for years. Essentially, every major company that spends tens of billions annually on GPU procurement is developing its own accelerator.

The reason for this wave of custom chips is not complicated. NVIDIA’s 70%+ gross margin means that for every H200 or B200 a customer buys, NVIDIA takes the lion’s share. For large companies spending tens of billions on GPUs annually, there is a strong incentive to keep that margin in-house. Even if the custom chip’s performance is only 60-70% of NVIDIA’s, as long as the cost-performance ratio is good, the overall equation works.

2.3 Key Judgment: Select Chips by Workload, Multi-Platform Coexistence

It’s important to avoid a common misunderstanding: the rise of custom chips does not mean NVIDIA is collapsing. A more accurate picture is: multi-platform coexistence, precise matching of workload to platform.

Anthropic is a specimen of this paradigm. It uses three hardware sets simultaneously:

  • Google TPU (Ironwood): for large-scale training and inference
  • AWS Trainium: for a portion of training loads
  • NVIDIA GPUs (B200, GB200): for inference services and some training

Why three sets? Because different workloads have different cost/performance characteristics on different hardware. Anthropic does not depend on any single supplier, placing each load on the most suitable chip.

This model is spreading in 2026. OpenAI, Meta, and Google DeepMind are all doing similar multi-platform architectures. The result: NVIDIA’s share may gradually decline from 80%, but its absolute revenue is still growing (because the entire AI computing market is expanding much faster than any single alternative).

2.4 The Lever Loosening the Monopoly: Why Inference

Why did this wave of substitution only appear in 2025-2026? Because the lever for loosening the monopoly changed – from training to inference.

In the training phase, the key is “can it run?” Pre-training a frontier large model requires tens of thousands of top-tier GPUs running for months. The key factors are peak computing power, memory bandwidth, and inter-card communication speed. NVIDIA has long led in these three areas. Customers will spend whatever it takes to use H100/B200 because other cards can’t run it.

In the inference phase, the key is “cost per token.” The model is already trained; the task is to answer user questions 365 days a year. A large inference service processes tens of thousands of tokens per second. The key is the electricity + hardware depreciation cost per token. On this curve, the cost-effectiveness advantage of specialized ASICs (TPU, Trainium) becomes prominent. TPU v6e has about 4x the cost-effectiveness compared to H100.

In one sentence: The economics of inference are loosening the monopoly that NVIDIA built on training.

The extension of this logic: as inference surpasses training in total computing consumption (about 2/3 vs. 1/3) in 2026, the hardware spending focus of the entire AI industry shifts from “performance-first” to “cost-effectiveness first.” This is the fundamental reason NVIDIA’s share is starting to be eroded.

2.5 The China Line: Another Ecosystem Forced by Sanctions

Add a parallel line: China’s AI hardware ecosystem.

U.S. export controls on AI chips for China began in 2022, tightening progressively through 2023, 2024, and 2025. At the strictest point, even crippled versions like the H800 and H20 could not be sold to China.

The intended purpose of these controls was to slow China’s AI development. The actual effect was to accelerate domestic substitution. Huawei’s Ascend series chips held about a 62% share of the domestic AI accelerator market in 2026. Cambricon, Hygon, and Enflame each have a portion. Chinese internet giants (Alibaba, ByteDance, Tencent) are all developing their own accelerators.

DeepSeek is a landmark company on this path, but a careful look at its hardware reveals that “domestic substitution” is far from complete as reported externally.

DeepSeek’s Hardware Reality (based on public data):

A more subtle detail: DeepSeek used PTX (NVIDIA’s proprietary assembly-level instruction set, lower level than CUDA) for extreme optimization during training on NVIDIA. Technically, this is still within the NVIDIA ecosystem, just bypassing the CUDA API layer. DeepSeek spent months porting code to Huawei’s CANN framework (a domestic alternative to CUDA), but in practice, it can only stably run inference; training often crashes.

This picture is different from what people imagine. From 2024-2026, Chinese AI companies actually run on a two-stack system: “training relies on smuggled NVIDIA + inference relies on Huawei.” Huawei’s Ascend 910C running DeepSeek inference achieves about 60% the performance of an H100; this end is truly feasible. But on the training end, the moat of the CUDA ecosystem is as deep in China as it is in the US.

The significance of the China line therefore needs to be re-evaluated. An accurate description is: China has found a feasible fallback on the inference side (Huawei Ascend), but the training side is still constrained by CUDA, and domestic substitution is far from complete. The statement that “China has built a complete NVIDIA replacement system” was still not true in 2026. This is entirely the Chinese version of the same story from Section 1.5: “hardware performance can catch up, but software ecosystem is hard to match.”

The existence of this situation still weakens the “NVIDIA is completely irreplaceable” narrative, at least providing a second source on the inference side. But it does not weaken the “CUDA lock-in” moat on the training side; instead, it validates it with a new case.

III. Supply Chain and Networking

Having covered the chips themselves, the next layer is the supply chain. The chip is just the end point; what really determines how much computing power can be created are three upstream chokepoints: advanced process nodes, HBM memory, and CoWoS packaging. A shortage in any one of these constrains the expansion rate of the entire AI industry.

After the supply chain, we also need to cover the “connection between chips,” i.e., networking. In a 10,000-GPU cluster, the communication bottleneck between cards and nodes is harder to solve than single-card computing power. This section is a part of the foundational layer that is severely underestimated by the outside world.

3.1 The Three Chokepoints: Wafers, HBM, CoWoS

Advanced Process Logic Wafers (2nm/3nm). All top-tier AI chips use TSMC’s most advanced processes. NVIDIA, Google, AMD, and Apple are all fabless (fabless design companies); they design their own chips but outsource manufacturing entirely to foundries like TSMC. So NVIDIA’s B200 and Google’s TPU physically use the same factory during production. In 2026, TSMC’s 2nm capacity is fully booked into 2027 and beyond. TSMC CEO C.C. Wei has publicly stated that supply won’t catch up with demand until 2027.

HBM High Bandwidth Memory. When AI accelerators run training and inference, the bottleneck is often memory bandwidth, while computing power is sufficient. The chip constantly reads weights and data from memory. Standard DDR memory bandwidth is insufficient, requiring HBM. HBM is high-bandwidth memory made by vertically stacking multiple DRAM dies using TSV technology (Through-Silicon Vias). Only three companies can make it: SK hynix, Samsung, and Micron.

How tight is the global HBM supply in 2026? SK hynix directly stated: all of its HBM production capacity for the entire year of 2026 has already been fully booked. SK hynix accounts for about 50% of global HBM capacity. HBM4 (next generation) starts sampling in 2026, with mass production expected from late 2026 into 2027.

CoWoS Packaging (Chip on Wafer on Substrate). This is TSMC’s advanced packaging technology, stacking logic chips and HBM onto the same substrate. Almost all high-end AI accelerators rely on CoWoS: NVIDIA’s H100/B200, Google’s TPU, AWS’s Trainium, AMD’s MI300, Huawei’s Ascend – all use CoWoS or similar advanced packaging.

TSMC’s CoWoS capacity at the end of 2024 was about 35,000 wafers per month, with a target to expand to 130,000 wafers per month by the end of 2026 (a 271% increase). But NVIDIA has locked up half of this capacity (about 65,000 wafers per month), leaving the rest for all other customers to split. The CoWoS lead time has already extended to over 50 weeks.

3.2 Structural Shortage: Lasting into 2027

Explain the “shortage” clearly. Historically, shortages in the semiconductor industry have been cyclical: a certain product suddenly becomes a hit (phones, cars), demand surges, chip manufacturers expand capacity over 18-24 months, the market becomes oversupplied, prices fall, and the next cycle begins. This was the standard rhythm for the past 30 years; the 2018-2020 automotive chip shortage was an example.

The current AI shortage is not this kind of rhythm. The slope of the demand curve is structurally higher than the capacity curve. Specific numbers:

  • NVIDIA’s data center revenue grew from $47.5 billion in FY2024 to $193.7 billion in FY2026, a 4x increase in three years.
  • Over the same period, top-tier chip capacity (TSMC 2nm/3nm + CoWoS + HBM) grew roughly 2-3x.
  • The gap between these two speeds is structural and cannot be closed in the short term.

Why is it structural? Three reasons.

First, foundry construction cycle is long. Building a new 2nm fab from groundbreaking to volume production takes 3-5 years. TSMC’s Arizona 2nm fab broke ground in 2024 and won’t reach volume production until 2028. Demand doubles in two years, capacity doubles in five years – the rhythms are completely mismatched.

Second, HBM is not simple capacity expansion. Moving from HBM3 to HBM4 involves major process changes: new TSV technology, new stacking layer counts, new packaging methods. SK hynix, Samsung, and Micron are all racing but cannot keep up with the growth in demand. HBM4 volume production has been pushed from 2025 to late 2026 and then to 2027.

Third, CoWoS packaging is exclusively TSMC. There is no second company globally that can produce an equivalent level of 2.5D advanced packaging. Expansion is limited to the capital spending pace of a single company, with no possibility of competitive expansion.

Why is 2027 the inflection point? Three things will come together that year. First, several of TSMC’s new foundries (Arizona, Kumamoto, Germany) will start volume production. Second, HBM4 will enter large-scale shipment. Third, CoWoS monthly capacity will have expanded from 35,000 wafers at end-2024 to the target of 130,000 wafers. TSMC CEO C.C. Wei clearly stated at the earnings call in early 2026: “Supply will not catch up with demand until 2027.”

Practical implication for the AI industry: In 2026, any company wanting to buy top-tier AI accelerators will have to wait in line. OpenAI, Anthropic, Meta, Google’s own TPUs, xAI – all are competing for the same pool of CoWoS capacity. This is why the phenomenon described in Section 1.5 – “customers prefer to queue for H100s rather than switch to AMD” – looks like brand preference but is actually dictated by supply-side reality. AMD would like to sell but cannot: TSMC doesn’t have enough CoWoS capacity to allocate to it.

3.3 Geographic Concentration: A Physical Chokepoint

Mapping the capacity of the three chokepoints above reveals a troubling fact: global key AI chip capacity is extremely concentrated in two places.

Almost all advanced logic wafers are in Taiwan. TSMC’s 2nm/3nm lines are in Tainan and Hsinchu. Samsung also has 3nm in Hwaseong, South Korea, but its process lags behind TSMC. Intel 18A in the US is not yet in volume production. TSMC’s Arizona factory started small-scale 4nm production at end-2025, but 2nm will take years.

HBM is concentrated in South Korea (SK hynix, Samsung) and the US (Micron).

CoWoS is almost entirely with TSMC.

This means: the expansion of AI computing power worldwide physically depends on a few companies in Taiwan and South Korea continuing to expand production as planned. If any link in this system falters (earthquake, geopolitical event, supply chain disruption), the entire AI industry’s capacity expansion will be impacted.

This system has already been directly reshaped by geopolitics in 2024-2025. The US export controls on China led NVIDIA to shift H200 production intended for China to the next-generation Vera Rubin (confirmed orders from US customers). TSMC’s Arizona factory expansion is being driven by the US government through the CHIPS Act. The political risk of Taiwan itself is repeatedly being factored into the capital planning of every AI company.

3.4 Between Nodes: Networking is the True Bottleneck for Distributed Training/Inference

Having covered the chip layer and supply chain, we need to address an engineering layer severely underestimated by the outside world: networking.

In a cluster with tens of thousands of GPUs, no matter how powerful a single card is, if cross-card communication is slow, the entire cluster fails. This is a frequently overlooked but critical aspect of the large model era.

Networking is divided into two layers.

Intra-node (scale-up, high-bandwidth short distance): Interconnection of GPUs within a single server or single rack. NVIDIA’s NVLink and NVSwitch dominate this layer. The GB200 NVL72 connects 72 GPUs using NVLink into a “quasi-single GPU,” with full interconnect bandwidth at the level of 1.8 TB/s per card. At this bandwidth, 72 cards can work like a single super GPU with a huge memory.

Inter-node (scale-out, across data center racks): Connecting thousands of servers (each with 8 cards) into a 10,000-GPU cluster. This layer has a competition between two routes: InfiniBand (IB) vs. Ethernet.

InfiniBand is NVIDIA’s advantage line. NVIDIA acquired Mellanox in 2019, gaining control of IB. IB has extremely low latency (microseconds) and is the traditional choice for distributed training. Meta’s Llama 3 training cluster and xAI’s Memphis Colossus cluster both use IB.

Ethernet is the cloud providers’ bet. AWS, Google, and Microsoft are each heavily investing in high-speed Ethernet. Two reasons: first, IB is completely locked up by NVIDIA; the open ecosystem choice is Ethernet. Second, Ethernet is universal in cloud data centers, allowing reuse of existing operations systems. The latest 800G Ethernet with RDMA is approaching IB in bandwidth, and the latency gap is also narrowing. The judgment for 2026 is: the open-source LLM training community is tilting towards Ethernet, while closed-source large companies still use IB.

Communication bottleneck: all-reduce. In 10,000

Similar Articles

@seclink: https://x.com/seclink/status/2056711091129118741

X AI KOLs Following

In-depth interview with Jensen Huang, reviewing Nvidia's history from betting the company on CUDA to becoming the AI powerhouse, explaining the four scaling laws of AI and the development direction for the next decade, emphasizing compute bottlenecks and extreme co-design philosophy.

@VincentLogic: This video is essentially a 'must-watch' checklist for AI engineers! It clearly explains the 10 core papers that have shaped today's AI industry, ranging from the foundational Transformer architecture to LoRA fine-tuning, RAG, Agents, and even the latest MCP protocol. If you want to dive deeper into how…

X AI KOLs Timeline

This article recommends a video that systematically explains the 10 core papers shaping today's AI industry, covering Transformer, LoRA, RAG, Agents, and the MCP protocol, aiming to help engineers clarify the technological lineage.