@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…

X AI KOLs Following News

Summary

Discussion about rewriting parallelism to improve kernel performance using CuTe DSL and tile programming models for the FA4 (FlashAttention 4) kernel.

Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very tile-pilled, Tensor Core-maxxing kernel. We'd love to write (and repeatedly rewrite) such kernels with @blelbach and team's new tile programming models. https://t.co/91fKeCAYYt
Original Article
View Cached Full Text

Cached at: 06/12/26, 10:58 AM

Rewriting parallelism is a big move and it’d be nice to make it even faster than we can do with CuTe DSL.

FA4 is a very tile-pilled, Tensor Core-maxxing kernel. We’d love to write (and repeatedly rewrite) such kernels with @blelbach and team’s new tile programming models. https://t.co/91fKeCAYYt

Similar Articles