@levidiamode: 158/365 of GPU Programming I think I understand the high level differences between the FlashAttention 2, 3 and 4 forwar…
Summary
The author documents their progress in learning GPU programming, focusing on understanding the high-level differences between FlashAttention 2, 3, and 4 forward passes, and lists several low-level concepts they need to explore further.
View Cached Full Text
Cached at: 06/10/26, 11:58 PM
158/365 of GPU Programming
I think I understand the high level differences between the FlashAttention 2, 3 and 4 forward passes now but have yet to grasp the lower level details of each algorithm and the backward passes.
Need to spend some time learning more about cooperative thread arrays, DSMEM, emulation of the exp function, rowmax/rowsum, warp partitioning and specializatio, WGMMA, asynchrony, producer/consumer pipelines, etc
levi (@levidiamode): 157/365 of GPU Programming
Another FlashAttention4 resource that’s been really helpful for me is the talk @charles_irl gave last year on GPU Mode (basically the lecture version of We reverse-engineered Flash Attention 4 blog post which is awesome as well) about FA4’s code and
Similar Articles
@levidiamode: 157/365 of GPU Programming Another FlashAttention4 resource that's been really helpful for me is the talk @charles_irl …
A daily GPU programming thread highlights a talk by Charles_irl that reverse-engineers FlashAttention4 code before the paper release, praising the Modal team's deep code dissection and inferences about the forward pass.
@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …
A learner shares enthusiasm for Stanford CS336 lecture 7 on GPU parallelism, which covers fundamental operations and connects them to multi-GPU setups and parallelism techniques like tensor, data, and pipeline parallelism.
@charles_irl: Last fall, we shared our deep dive on FA4 internals. But we didn't stop at grokking the kernel. Since then, we've been …
A blog post details contributions to FlashAttention-4 to improve its performance for large language model inference, especially for decode-heavy workloads, by adjusting parallelism strategies and supporting irregular memory accesses.
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention
Proposes Lookahead Sparse Attention with a Neural Memory Indexer on DeepSeek-V4, reducing GPU memory usage to ~13.5% of full-context baseline while maintaining or slightly improving accuracy.
@charles_irl: A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads. Inference is different from t…
Explains that inference kernels differ from training, with Flash Attention 4 focusing on changing parallelism across KV and supporting small irregular loads.