attention-kernels

#attention-kernels

New set of FP4 attention kernels for B300, achieving up to 1.69x speedup over FA4

Reddit r/LocalLLaMA ↗ · 2026-07-14 Cached

The FastVideo team releases new FP4 attention kernels for B300, achieving up to 1.69x speedup over FlashAttention 4.

0 favorites 0 likes

#attention-kernels

Exploring FlashAttention-3/4 optimizations on RTX GPUs

Reddit r/LocalLLaMA ↗ · 2026-07-09

This article explores whether FlashAttention-3/4 optimizations benefit RTX GPUs, concluding that FA-2 is the ceiling due to hardware limitations on consumer cards.

0 favorites 0 likes

#attention-kernels

@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…

X AI KOLs Following ↗ · 2026-05-21 Cached

A detailed blog post dissecting ThunderKittens, a compact DSL for high-performance AI kernels, including a bottom-up analysis of its abstractions and a benchmark implementing a non-causal attention prefill kernel that outperforms FlashAttention-2 by ~1.55x and matches FlashAttention-3.

0 favorites 0 likes

#attention-kernels

@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…

X AI KOLs Following ↗ · 2026-04-21

Moonshot AI releases FlashKDA, an open-source CUTLASS-based implementation of Kimi Delta Attention kernels that delivers 1.72×–2.22× prefill speedup on H20 GPUs.

0 favorites 0 likes

attention-kernels

New set of FP4 attention kernels for B300, achieving up to 1.69x speedup over FA4

Exploring FlashAttention-3/4 optimizations on RTX GPUs

@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…

@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…

Submit Feedback