@ZhihuFrontier: GPU programming changed because Tensor Cores became too fast to feed Zhihu contributor THU-PACMAN实验室 shared a sharp bre…
Summary
A detailed analysis of how NVIDIA GPU programming evolved from Volta to Blackwell, highlighting the shift from synchronous thread models to asynchronous dataflow and the challenges of feeding Tensor Cores. The article discusses new hardware features like TMA, TMEM, and tcgen05 MMA, and shows how modern kernels like FlashAttention-3 and FlashMLA exploit these changes for higher utilization.
View Cached Full Text
Cached at: 06/30/26, 09:40 AM
GPU programming changed because Tensor Cores became too fast to feed Zhihu contributor THU-PACMAN实验室 shared a sharp breakdown of how NVIDIA GPU programming moved from Volta to Blackwell. The headline is not “FP16 became FP4” or “TFLOPS went up.” The real shift is deeper: a CUDA kernel is no longer just a group of threads running in sync. It is becoming an asynchronous dataflow program across compute units, memory engines, barriers, buffers, and layouts.
The bottleneck moved to feeding Tensor Cores From V100 to B100, peak FP16 throughput exploded: V100: 112 TFLOPS, 0.90 TB/s HBM, ~124 FLOPS/Byte A100: 312 TFLOPS, 2.04 TB/s HBM, ~153 FLOPS/Byte H100: 990 TFLOPS, 3.35 TB/s HBM, ~296 FLOPS/Byte B100: ~1800 TFLOPS, 8.00 TB/s HBM, ~225 FLOPS/Byte On Hopper, even if HBM is fully saturated, each byte from global memory has to support roughly 300 floating-point operations. So the hard part is no longer “do we have matrix units?” It is how to keep them continuously fed.
Data movement became a hardware pipeline Volta still looked like classic CUDA: threads computed addresses, loaded from global memory, staged through registers, wrote into shared memory, then synchronized. Ampere added cp.async, letting data move from global memory to shared memory without going through registers. Hopper pushed the model further with TMA, where hardware handles tile-level address generation, strides, boundaries, and background movement. Blackwell adds TMEM, a dedicated tensor memory layer, and makes the kernel look even less like a normal thread program.
Matrix instructions also changed their meaning Volta wmma: a warp synchronously computes together. Ampere mma.sync: more control over shape and dtype, with explicit shared-memory layouts. Hopper http://wgmma.mma_async: Tensor Cores become an asynchronous compute agent. Blackwell http://tcgen05.mma: accumulation can live in TMEM, a new 256KB/SM tensor storage layer. The mental model changes from “all threads participate in this computation” to “launch work to a hardware unit, track its completion, and manage where the result lives.”
Synchronization is no longer just “wait for threads” Traditional CUDA synchronization was mostly control-flow synchronization: __syncthreads() means everyone in the block has arrived. But with TMA, WGMMA, and Blackwell async MMA, the important question is different: who produced this data, who consumes it, which async proxy owns the operation, what signal proves the data is ready, and when the buffer can be reused. In modern kernels, synchronization is becoming fine-grained data dependency management.
Low precision is not just changing dtype FP8 and FP4 are often described as smaller types. But on Blackwell, low precision becomes a whole constraint system. For block-scaled MMA, the compiler must jointly reason about dtype, scale vector size, scale tensor layout, operand packing, alignment, physical layout, accumulator type, tile shape, dispatch policy, and which tcgen05 instruction is legal. So instruction selection is no longer a late backend peephole. It is tied to dtype, packing, scale placement, layout, accumulator choice, and epilogue design.
FlashAttention shows the new programming model clearly FlashAttention-3 on H100 treats the GPU like a set of schedulable hardware units: one warpgroup runs WGMMA on Tensor Cores, another overlaps softmax on CUDA Cores, and TMA prefetches the next tile in the background. This “ping-pong scheduling” raised utilization sharply: FlashAttention-2 used about 35% of H100 peak, FA3 FP16 reached 740 TFLOPS, and FA3 FP8 approached 1.2 PFLOPS. FlashMLA shows the same lesson from another angle. MLA decoding can require 32,768 32-bit registers for one 64×512 output tile, while an SM has only 65,536 registers. DeepSeek’s “seesaw scheduling” splits the output, alternates warpgroups, overlaps Tensor Core and CUDA Core work, and starts GEMM as soon as the first fine-grained TMA copy arrives. Modern kernel optimization is no longer just choosing an algorithm. It is designing a schedule for hardware units.
What this means for compilers A compiler or DSL for modern AI chips needs to describe at least five things: tile shape, memory movement, physical layout, async producer-consumer dependencies, and the schedule of copy, compute, sync, and writeback. Without these, correctness is hard to verify and performance is hard to explain. This also means a unified frontend does not magically produce optimal backend code. Different hardware still needs specialized schedules, synchronization strategies, layouts, and resource models. The biggest change from Volta to Blackwell is not a faster instruction. It is a new way to think about a GPU kernel: from synchronous thread cooperation to asynchronous dataflow across hardware components.
Full analysis: https://zhuanlan.zhihu.com/p/2054548076421978077…
#GPU #CUDA #AIInfra #Compiler #HPC #Blackwell #FlashAttention #Triton
Similar Articles
@levidiamode: 158/365 of GPU Programming I think I understand the high level differences between the FlashAttention 2, 3 and 4 forwar…
The author documents their progress in learning GPU programming, focusing on understanding the high-level differences between FlashAttention 2, 3, and 4 forward passes, and lists several low-level concepts they need to explore further.
@snowboat84: https://x.com/snowboat84/status/2061962883651731602
This article is the first part of the AI Engineering Panorama series. From a historical perspective, it reviews the evolution of GPUs from gaming graphics cards to AI accelerators, the bold bet of CUDA, the independent path of Google's TPU, and why NVIDIA ultimately prevailed. It also provides a detailed analysis of the underlying logic of AI infrastructure such as chips, supply chain, networking, and power.
@vivekgalatage: It's super interesting to know the system architecture of the TPUs. https://henryhmko.github.io/posts/tpu/tpu.html…
A deep dive into Google's TPU architecture, explaining the design philosophy of systolic arrays, pipelining, and ahead-of-time compilation that enables high throughput and energy efficiency.
@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…
CODA introduces a GPU kernel abstraction that rewrites transformer computations as GEMM-plus-epilogue programs, reducing memory-bound operations and improving efficiency in training.
@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …
A learner shares enthusiasm for Stanford CS336 lecture 7 on GPU parallelism, which covers fundamental operations and connects them to multi-GPU setups and parallelism techniques like tensor, data, and pipeline parallelism.