Tag
Proposes mechanism-driven monitors for preemptive detection of LLM training instability by deriving internal signals from low-precision flash attention and MoE routers, enabling detection thousands of steps before loss divergence.
A new book from CMU's Machine Learning Systems course teaches modern GPU programming for ML systems, covering Blackwell architecture, GEMM, and FlashAttention using the TIRx Python DSL.
NVIDIA Research publishes a technical blog post examining KV cache compression techniques and their infrastructure problems, including how FlashAttention and paged attention create practical obstacles for production deployment of long-context LLMs, with a proposed geometric solution using RoPE.
Discussion about rewriting parallelism to improve kernel performance using CuTe DSL and tile programming models for the FA4 (FlashAttention 4) kernel.
Explains that inference kernels differ from training, with Flash Attention 4 focusing on changing parallelism across KV and supporting small irregular loads.
Parallax is a new parametrized form of Local Linear Attention that eliminates numerical solvers and matches FlashAttention 2/3 in decoding. Its effectiveness depends on the optimizer, working with Muon but not AdamW, highlighting the role of optimizer geometry.
The author documents their progress in learning GPU programming, focusing on understanding the high-level differences between FlashAttention 2, 3, and 4 forward passes, and lists several low-level concepts they need to explore further.
This paper analyzes precision loss in FP8 attention due to the attention sink phenomenon when casting the softmax output to FP8 (E4M3). It shows that forward KV iteration causes underflow of non-sink attention values, and proposes reverse iteration and a static scaling factor S=256 to eliminate underflow, achieving 3-10x MSE improvement.
Kazuki Fujii announces the first installment of a blog series on CUDA Programming basics, written in an accessible way, essential for understanding FlashAttention and hardware-aware acceleration techniques.
A new packed16 K technique for llama.cpp on RDNA3 GPUs reduces KV cache VRAM by 47% compared to Vulkan fp16, using int8 packing and native dot4 instructions to maintain fp16-quality K values with minimal KLD loss.
This pull request for the llama.cpp inference engine implements using f16 mask for Flash Attention to reduce VRAM usage.
A tweet showcasing a CuTe DSL kernel sample that uses layouts to express transposition, part of the FlashAttention-4 kernel.
This article critically analyzes the claims and timeline of the subQ long-context AI technique, highlighting discrepancies and walkbacks from the original announcement.
Custom binary workaround enables flash attention on AMD RDNA2 GPUs for llama.cpp, doubling inference speed (70-80 tok/s vs stock crash). Only confirmed working with Qwen3.6 35B/27B.
Introduces DualKV, a FlashAttention kernel variant that eliminates redundant prompt token computation in RL post-training (GRPO/DAPO), achieving up to 3.82x speedup on 30B MoE models.
Lighthouse Attention is a selection-based hierarchical attention mechanism that accelerates long-context pretraining by running forward+backward passes ~17× faster at 512K context and delivering 1.4–1.7× end-to-end speedup at 98K context, validated with Llama-3 530M on 50B tokens.
llama.cpp b9158 has been released with a fix for Flash Attention on RDNA3 GPUs, improving performance for AMD users.
A lecture on LLM deployment techniques covering AWQ, vLLM, FlashAttention, quantization, and activation smoothing for efficient serving.
Meta's In-Kernel Broadcast Optimization (IKBO) eliminates redundant user-embedding broadcast in RecSys inference via kernel-model-system co-design, delivering up to 2/3 latency reduction and ~4x speedup on H100 GPUs, and serving as the backbone for the Meta Adaptive Ranking Model.
A highly optimized version of OpenAI's Whisper Large v3 using Transformers, Optimum, and Flash Attention 2, capable of transcribing 150 minutes of audio in under 2 minutes on Replicate.