gpu-kernel

#gpu-kernel

@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…

X AI KOLs Following ↗ · 2026-05-26

A tweet showcasing a CuTe DSL kernel sample that uses layouts to express transposition, part of the FlashAttention-4 kernel.

0 favorites 0 likes

#gpu-kernel

@PyTorch: PyTorch member Meta just open-sourced a GPU kernel that makes attention 2.3x faster on NVIDIA Blackwell. TLX Block Atte…

X AI KOLs Following ↗ · 2026-05-26 Cached

Meta open-sources TLX Block Attention, a warp-specialized Triton kernel that achieves 2.3x speedup for block-diagonal self-attention on NVIDIA Blackwell GPUs, with up to 3.5x speedup when fused with rotary embeddings.

0 favorites 0 likes

#gpu-kernel

@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…

X AI KOLs Following ↗ · 2026-05-21 Cached

CODA introduces a GPU kernel abstraction that rewrites transformer computations as GEMM-plus-epilogue programs, reducing memory-bound operations and improving efficiency in training.

0 favorites 0 likes

#gpu-kernel

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Reddit r/LocalLLaMA ↗ · 2026-04-22

MoonshotAI released FlashKDA, open-source CUTLASS kernels for Kimi Delta Attention that deliver up to 2.22x speedup over Triton on H20 GPUs.

0 favorites 0 likes

gpu-kernel

@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…

@PyTorch: PyTorch member Meta just open-sourced a GPU kernel that makes attention 2.3x faster on NVIDIA Blackwell. TLX Block Atte…

@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Submit Feedback