flash-attention

#flash-attention

Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability

arXiv cs.CL ↗ · 20h ago Cached

Proposes mechanism-driven monitors for preemptive detection of LLM training instability by deriving internal signals from low-precision flash attention and MoE routers, enabling detection thousands of steps before loss divergence.

0 favorites 0 likes

#flash-attention

Modern GPU Programming for MLSys

Hacker News Top ↗ · 6d ago Cached

A new book from CMU's Machine Learning Systems course teaches modern GPU programming for ML systems, covering Blackwell architecture, GEMM, and FlashAttention using the TIRx Python DSL.

0 favorites 0 likes

#flash-attention

@yukangchen_: We are excited to share a new technical article “KV Cache Compression and Its Infra Problems.” https://research.nvidia.…

X AI KOLs Timeline ↗ · 2026-06-16 Cached

NVIDIA Research publishes a technical blog post examining KV cache compression techniques and their infrastructure problems, including how FlashAttention and paged attention create practical obstacles for production deployment of long-context LLMs, with a proposed geometric solution using RoPE.

0 favorites 0 likes

#flash-attention

@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…

X AI KOLs Following ↗ · 2026-06-11 Cached

Discussion about rewriting parallelism to improve kernel performance using CuTe DSL and tile programming models for the FA4 (FlashAttention 4) kernel.

0 favorites 0 likes

#flash-attention

@charles_irl: A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads. Inference is different from t…

X AI KOLs Following ↗ · 2026-06-11 Cached

Explains that inference kernels differ from training, with Flash Attention 4 focusing on changing parallelism across KV and supporting small irregular loads.

0 favorites 0 likes

#flash-attention

@maximelabonne: Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode…

X AI KOLs Following ↗ · 2026-06-10 Cached

Parallax is a new parametrized form of Local Linear Attention that eliminates numerical solvers and matches FlashAttention 2/3 in decoding. Its effectiveness depends on the optimizer, working with Muon but not AdamW, highlighting the role of optimizer geometry.

0 favorites 0 likes

#flash-attention

@levidiamode: 158/365 of GPU Programming I think I understand the high level differences between the FlashAttention 2, 3 and 4 forwar…

X AI KOLs Timeline ↗ · 2026-06-10 Cached

The author documents their progress in learning GPU programming, focusing on understanding the high-level differences between FlashAttention 2, 3, and 4 forward passes, and lists several low-level concepts they need to explore further.

0 favorites 0 likes

#flash-attention

P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8

arXiv cs.AI ↗ · 2026-06-08 Cached

This paper analyzes precision loss in FP8 attention due to the attention sink phenomenon when casting the softmax output to FP8 (E4M3). It shows that forward KV iteration causes underflow of non-sink attention values, and proposes reverse iteration and a static scaling factor S=256 to eliminate underflow, achieving 3-10x MSE improvement.

0 favorites 0 likes

#flash-attention

@kazukifujii: Tech Blog Release Day5 This is the first installment of a blog series that explains CUDA Programming from the basics, w…

X AI KOLs Timeline ↗ · 2026-06-04 Cached

Kazuki Fujii announces the first installment of a blog series on CUDA Programming basics, written in an accessible way, essential for understanding FlashAttention and hardware-aware acceleration techniques.

0 favorites 0 likes

#flash-attention

Flash Attention for llama.cpp on RDNA3: 47% less KV VRAM than Vulkan f16 K, KLD almost losselss on F16 K / q4_0 V. Part 1.

Reddit r/LocalLLaMA ↗ · 2026-05-31

A new packed16 K technique for llama.cpp on RDNA3 GPUs reduces KV cache VRAM by 47% compared to Vulkan fp16, using int8 packing and native dot4 instructions to maintain fp16-quality K values with minimal KLD loss.

0 favorites 0 likes

#flash-attention

llama: use f16 mask for FA to save VRAM by am17an · Pull Request #23764 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-05-29 Cached

This pull request for the llama.cpp inference engine implements using f16 mask for Flash Attention to reduce VRAM usage.

0 favorites 0 likes

#flash-attention

@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…

X AI KOLs Following ↗ · 2026-05-26

A tweet showcasing a CuTe DSL kernel sample that uses layouts to express transposition, part of the FlashAttention-4 kernel.

0 favorites 0 likes

#flash-attention

@no_stp_on_snek: https://subq.mildlyconcerning.com

X AI KOLs Timeline ↗ · 2026-05-26 Cached

This article critically analyzes the claims and timeline of the subQ long-context AI technique, highlighting discrepancies and walkbacks from the original announcement.

0 favorites 0 likes

#flash-attention

RDNA2 flash attention isn’t enabled stock, I enabled it with this build and doubled my speed

Reddit r/LocalLLaMA ↗ · 2026-05-19

Custom binary workaround enables flash attention on AMD RDNA2 GPUs for llama.cpp, doubling inference speed (70-80 tok/s vs stock crash). Only confirmed working with Qwen3.6 35B/27B.

0 favorites 0 likes

#flash-attention

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

arXiv cs.LG ↗ · 2026-05-18 Cached

Introduces DualKV, a FlashAttention kernel variant that eliminates redundant prompt token computation in RL post-training (GRPO/DAPO), achieving up to 3.82x speedup on 30B MoE models.

0 favorites 0 likes

#flash-attention

Lighthouse Attention (11 minute read)

TLDR AI ↗ · 2026-05-18 Cached

Lighthouse Attention is a selection-based hierarchical attention mechanism that accelerates long-context pretraining by running forward+backward passes ~17× faster at 512K context and delivering 1.4–1.7× end-to-end speedup at 98K context, validated with Llama-3 530M on 50B tokens.

0 favorites 0 likes

#flash-attention

RDNA3 Flash Attention fix just dropped by llama.cpp b9158

Reddit r/LocalLLaMA ↗ · 2026-05-15

llama.cpp b9158 has been released with a fix for Flash Attention on RDNA3 GPUs, improving performance for AMD users.

0 favorites 0 likes

#flash-attention

@ickma2311: Efficient AI Lecture 13: LLM Deployment Techniques The lecture helped me understand AWQ, vLLM, and FlashAttention very …

X AI KOLs Timeline ↗ · 2026-05-13 Cached

A lecture on LLM deployment techniques covering AWQ, vLLM, FlashAttention, quantization, and activation smoothing for efficient serving.

0 favorites 0 likes

#flash-attention

Meta's Optimized RecSys Inference (58 minute read)

TLDR AI ↗ · 2026-05-08 Cached

Meta's In-Kernel Broadcast Optimization (IKBO) eliminates redundant user-embedding broadcast in RecSys inference via kernel-model-system co-design, delivering up to 2/3 latency reduction and ~4x speedup on H100 GPUs, and serving as the backbone for the Meta Adaptive Ranking Model.

0 favorites 0 likes

#flash-attention

vaibhavs10/incredibly-fast-whisper

Replicate Explore ↗ · 2026-05-08 Cached

A highly optimized version of OpenAI's Whisper Large v3 using Transformers, Optimum, and Flash Attention 2, capable of transcribing 150 minutes of audio in under 2 minutes on Replicate.

0 favorites 0 likes

flash-attention

Submit Feedback