Tag
A blog post details contributions to FlashAttention-4 to improve its performance for large language model inference, especially for decode-heavy workloads, by adjusting parallelism strategies and supporting irregular memory accesses.
A daily GPU programming thread highlights a talk by Charles_irl that reverse-engineers FlashAttention4 code before the paper release, praising the Modal team's deep code dissection and inferences about the forward pass.