@ZhihuFrontier: GPU programming changed because Tensor Cores became too fast to feed Zhihu contributor THU-PACMAN实验室 shared a sharp bre…

X AI KOLs Timeline 06/30/26, 08:20 AM News

gpu-programming nvidia tensor-cores cuda blackwell hopper volta data-movement low-precision

Summary

A detailed analysis of how NVIDIA GPU programming evolved from Volta to Blackwell, highlighting the shift from synchronous thread models to asynchronous dataflow and the challenges of feeding Tensor Cores. The article discusses new hardware features like TMA, TMEM, and tcgen05 MMA, and shows how modern kernels like FlashAttention-3 and FlashMLA exploit these changes for higher utilization.

GPU programming changed because Tensor Cores became too fast to feed Zhihu contributor THU-PACMAN实验室 shared a sharp breakdown of how NVIDIA GPU programming moved from Volta to Blackwell. The headline is not “FP16 became FP4” or “TFLOPS went up.” The real shift is deeper: a CUDA kernel is no longer just a group of threads running in sync. It is becoming an asynchronous dataflow program across compute units, memory engines, barriers, buffers, and layouts. The bottleneck moved to feeding Tensor Cores From V100 to B100, peak FP16 throughput exploded: V100: 112 TFLOPS, 0.90 TB/s HBM, ~124 FLOPS/Byte A100: 312 TFLOPS, 2.04 TB/s HBM, ~153 FLOPS/Byte H100: 990 TFLOPS, 3.35 TB/s HBM, ~296 FLOPS/Byte B100: ~1800 TFLOPS, 8.00 TB/s HBM, ~225 FLOPS/Byte On Hopper, even if HBM is fully saturated, each byte from global memory has to support roughly 300 floating-point operations. So the hard part is no longer “do we have matrix units?” It is how to keep them continuously fed. Data movement became a hardware pipeline Volta still looked like classic CUDA: threads computed addresses, loaded from global memory, staged through registers, wrote into shared memory, then synchronized. Ampere added cp.async, letting data move from global memory to shared memory without going through registers. Hopper pushed the model further with TMA, where hardware handles tile-level address generation, strides, boundaries, and background movement. Blackwell adds TMEM, a dedicated tensor memory layer, and makes the kernel look even less like a normal thread program. Matrix instructions also changed their meaning Volta wmma: a warp synchronously computes together. Ampere mma.sync: more control over shape and dtype, with explicit shared-memory layouts. Hopper http://wgmma.mma_async: Tensor Cores become an asynchronous compute agent. Blackwell http://tcgen05.mma: accumulation can live in TMEM, a new 256KB/SM tensor storage layer. The mental model changes from “all threads participate in this computation” to “launch work to a hardware unit, track its completion, and manage where the result lives.” Synchronization is no longer just “wait for threads” Traditional CUDA synchronization was mostly control-flow synchronization: __syncthreads() means everyone in the block has arrived. But with TMA, WGMMA, and Blackwell async MMA, the important question is different: who produced this data, who consumes it, which async proxy owns the operation, what signal proves the data is ready, and when the buffer can be reused. In modern kernels, synchronization is becoming fine-grained data dependency management. Low precision is not just changing dtype FP8 and FP4 are often described as smaller types. But on Blackwell, low precision becomes a whole constraint system. For block-scaled MMA, the compiler must jointly reason about dtype, scale vector size, scale tensor layout, operand packing, alignment, physical layout, accumulator type, tile shape, dispatch policy, and which tcgen05 instruction is legal. So instruction selection is no longer a late backend peephole. It is tied to dtype, packing, scale placement, layout, accumulator choice, and epilogue design. FlashAttention shows the new programming model clearly FlashAttention-3 on H100 treats the GPU like a set of schedulable hardware units: one warpgroup runs WGMMA on Tensor Cores, another overlaps softmax on CUDA Cores, and TMA prefetches the next tile in the background. This “ping-pong scheduling” raised utilization sharply: FlashAttention-2 used about 35% of H100 peak, FA3 FP16 reached 740 TFLOPS, and FA3 FP8 approached 1.2 PFLOPS. FlashMLA shows the same lesson from another angle. MLA decoding can require 32,768 32-bit registers for one 64×512 output tile, while an SM has only 65,536 registers. DeepSeek’s “seesaw scheduling” splits the output, alternates warpgroups, overlaps Tensor Core and CUDA Core work, and starts GEMM as soon as the first fine-grained TMA copy arrives. Modern kernel optimization is no longer just choosing an algorithm. It is designing a schedule for hardware units. What this means for compilers A compiler or DSL for modern AI chips needs to describe at least five things: tile shape, memory movement, physical layout, async producer-consumer dependencies, and the schedule of copy, compute, sync, and writeback. Without these, correctness is hard to verify and performance is hard to explain. This also means a unified frontend does not magically produce optimal backend code. Different hardware still needs specialized schedules, synchronization strategies, layouts, and resource models. The biggest change from Volta to Blackwell is not a faster instruction. It is a new way to think about a GPU kernel: from synchronous thread cooperation to asynchronous dataflow across hardware components. Full analysis: https://zhuanlan.zhihu.com/p/2054548076421978077… #GPU #CUDA #AIInfra #Compiler #HPC #Blackwell #FlashAttention #Triton

Original Article

View Cached Full Text

Cached at: 06/30/26, 09:40 AM

The bottleneck moved to feeding Tensor Cores From V100 to B100, peak FP16 throughput exploded: V100: 112 TFLOPS, 0.90 TB/s HBM, ~124 FLOPS/Byte A100: 312 TFLOPS, 2.04 TB/s HBM, ~153 FLOPS/Byte H100: 990 TFLOPS, 3.35 TB/s HBM, ~296 FLOPS/Byte B100: ~1800 TFLOPS, 8.00 TB/s HBM, ~225 FLOPS/Byte On Hopper, even if HBM is fully saturated, each byte from global memory has to support roughly 300 floating-point operations. So the hard part is no longer “do we have matrix units?” It is how to keep them continuously fed.

Data movement became a hardware pipeline Volta still looked like classic CUDA: threads computed addresses, loaded from global memory, staged through registers, wrote into shared memory, then synchronized. Ampere added cp.async, letting data move from global memory to shared memory without going through registers. Hopper pushed the model further with TMA, where hardware handles tile-level address generation, strides, boundaries, and background movement. Blackwell adds TMEM, a dedicated tensor memory layer, and makes the kernel look even less like a normal thread program.

Matrix instructions also changed their meaning Volta wmma: a warp synchronously computes together. Ampere mma.sync: more control over shape and dtype, with explicit shared-memory layouts. Hopper http://wgmma.mma_async: Tensor Cores become an asynchronous compute agent. Blackwell http://tcgen05.mma: accumulation can live in TMEM, a new 256KB/SM tensor storage layer. The mental model changes from “all threads participate in this computation” to “launch work to a hardware unit, track its completion, and manage where the result lives.”

Synchronization is no longer just “wait for threads” Traditional CUDA synchronization was mostly control-flow synchronization: __syncthreads() means everyone in the block has arrived. But with TMA, WGMMA, and Blackwell async MMA, the important question is different: who produced this data, who consumes it, which async proxy owns the operation, what signal proves the data is ready, and when the buffer can be reused. In modern kernels, synchronization is becoming fine-grained data dependency management.

Low precision is not just changing dtype FP8 and FP4 are often described as smaller types. But on Blackwell, low precision becomes a whole constraint system. For block-scaled MMA, the compiler must jointly reason about dtype, scale vector size, scale tensor layout, operand packing, alignment, physical layout, accumulator type, tile shape, dispatch policy, and which tcgen05 instruction is legal. So instruction selection is no longer a late backend peephole. It is tied to dtype, packing, scale placement, layout, accumulator choice, and epilogue design.

FlashAttention shows the new programming model clearly FlashAttention-3 on H100 treats the GPU like a set of schedulable hardware units: one warpgroup runs WGMMA on Tensor Cores, another overlaps softmax on CUDA Cores, and TMA prefetches the next tile in the background. This “ping-pong scheduling” raised utilization sharply: FlashAttention-2 used about 35% of H100 peak, FA3 FP16 reached 740 TFLOPS, and FA3 FP8 approached 1.2 PFLOPS. FlashMLA shows the same lesson from another angle. MLA decoding can require 32,768 32-bit registers for one 64×512 output tile, while an SM has only 65,536 registers. DeepSeek’s “seesaw scheduling” splits the output, alternates warpgroups, overlaps Tensor Core and CUDA Core work, and starts GEMM as soon as the first fine-grained TMA copy arrives. Modern kernel optimization is no longer just choosing an algorithm. It is designing a schedule for hardware units.

What this means for compilers A compiler or DSL for modern AI chips needs to describe at least five things: tile shape, memory movement, physical layout, async producer-consumer dependencies, and the schedule of copy, compute, sync, and writeback. Without these, correctness is hard to verify and performance is hard to explain. This also means a unified frontend does not magically produce optimal backend code. Different hardware still needs specialized schedules, synchronization strategies, layouts, and resource models. The biggest change from Volta to Blackwell is not a faster instruction. It is a new way to think about a GPU kernel: from synchronous thread cooperation to asynchronous dataflow across hardware components.

Full analysis: https://zhuanlan.zhihu.com/p/2054548076421978077…

#GPU #CUDA #AIInfra #Compiler #HPC #Blackwell #FlashAttention #Triton

@ZhihuFrontier: GPU programming changed because Tensor Cores became too fast to feed Zhihu contributor THU-PACMAN实验室 shared a sharp bre…

Similar Articles

@levidiamode: 158/365 of GPU Programming I think I understand the high level differences between the FlashAttention 2, 3 and 4 forwar…

@snowboat84: https://x.com/snowboat84/status/2061962883651731602

@vivekgalatage: It's super interesting to know the system architecture of the TPUs. https://henryhmko.github.io/posts/tpu/tpu.html…

@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…

@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …

Submit Feedback

Similar Articles

@levidiamode: 158/365 of GPU Programming I think I understand the high level differences between the FlashAttention 2, 3 and 4 forwar…

@snowboat84: https://x.com/snowboat84/status/2061962883651731602

@vivekgalatage: It's super interesting to know the system architecture of the TPUs. https://henryhmko.github.io/posts/tpu/tpu.html…

@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…

@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …