@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…

X AI KOLs Following 06/11/26, 07:04 PM News

parallelism cuda tensor-core kernel-programming ai-optimization flash-attention

Summary

Discussion about rewriting parallelism to improve kernel performance using CuTe DSL and tile programming models for the FA4 (FlashAttention 4) kernel.

Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very tile-pilled, Tensor Core-maxxing kernel. We'd love to write (and repeatedly rewrite) such kernels with @blelbach and team's new tile programming models. https://t.co/91fKeCAYYt

Original Article

View Cached Full Text

Cached at: 06/12/26, 10:58 AM

Rewriting parallelism is a big move and it’d be nice to make it even faster than we can do with CuTe DSL.

FA4 is a very tile-pilled, Tensor Core-maxxing kernel. We’d love to write (and repeatedly rewrite) such kernels with @blelbach and team’s new tile programming models. https://t.co/91fKeCAYYt

Similar Articles

@charles_irl: Last fall, we shared our deep dive on FA4 internals. But we didn't stop at grokking the kernel. Since then, we've been …

X AI KOLs Following

A blog post details contributions to FlashAttention-4 to improve its performance for large language model inference, especially for decode-heavy workloads, by adjusting parallelism strategies and supporting irregular memory accesses.

@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…

X AI KOLs Following

A tweet showcasing a CuTe DSL kernel sample that uses layouts to express transposition, part of the FlashAttention-4 kernel.

@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…

X AI KOLs Following

A detailed blog post dissecting ThunderKittens, a compact DSL for high-performance AI kernels, including a bottom-up analysis of its abstractions and a benchmark implementing a non-causal attention prefill kernel that outperforms FlashAttention-2 by ~1.55x and matches FlashAttention-3.

@levidiamode: 157/365 of GPU Programming Another FlashAttention4 resource that's been really helpful for me is the talk @charles_irl …

X AI KOLs Following

A daily GPU programming thread highlights a talk by Charles_irl that reverse-engineers FlashAttention4 code before the paper release, praising the Modal team's deep code dissection and inferences about the forward pass.

@charles_irl: A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads. Inference is different from t…

X AI KOLs Following

Explains that inference kernels differ from training, with Flash Attention 4 focusing on changing parallelism across KV and supporting small irregular loads.

Similar Articles

@charles_irl: Last fall, we shared our deep dive on FA4 internals. But we didn't stop at grokking the kernel. Since then, we've been …

@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…

@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…

@levidiamode: 157/365 of GPU Programming Another FlashAttention4 resource that's been really helpful for me is the talk @charles_irl …

@charles_irl: A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads. Inference is different from t…

Submit Feedback