Tag
A tweet showcasing a CuTe DSL kernel sample that uses layouts to express transposition, part of the FlashAttention-4 kernel.
Meta open-sources TLX Block Attention, a warp-specialized Triton kernel that achieves 2.3x speedup for block-diagonal self-attention on NVIDIA Blackwell GPUs, with up to 3.5x speedup when fused with rotary embeddings.
CODA introduces a GPU kernel abstraction that rewrites transformer computations as GEMM-plus-epilogue programs, reducing memory-bound operations and improving efficiency in training.
MoonshotAI released FlashKDA, open-source CUTLASS kernels for Kimi Delta Attention that deliver up to 2.22x speedup over Triton on H20 GPUs.