@derangineer: the goats in the game
Summary
Charles Frye announces a blog post detailing contributions to FA4 internals, focusing on inference performance improvements that have been upstreamed.
View Cached Full Text
Cached at: 06/12/26, 06:54 AM
the goats in the game
Charles 🎉 Frye (@charles_irl): Last fall, we shared our deep dive on FA4 internals.
But we didn’t stop at grokking the kernel.
Since then, we’ve been developing improvements for inference performance and upstreaming them.
This blog post explains those contributions.
Similar Articles
@charles_irl: Last fall, we shared our deep dive on FA4 internals. But we didn't stop at grokking the kernel. Since then, we've been …
A blog post details contributions to FlashAttention-4 to improve its performance for large language model inference, especially for decode-heavy workloads, by adjusting parallelism strategies and supporting irregular memory accesses.
@satvikgari: Wake up the goat posted
A progress update on reinventing Groq's LPU, with a redesigned vector execution module to better support overlap operations and self-attention.
@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…
Discussion about rewriting parallelism to improve kernel performance using CuTe DSL and tile programming models for the FA4 (FlashAttention 4) kernel.
@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…
A detailed blog post dissecting ThunderKittens, a compact DSL for high-performance AI kernels, including a bottom-up analysis of its abstractions and a benchmark implementing a non-causal attention prefill kernel that outperforms FlashAttention-2 by ~1.55x and matches FlashAttention-3.
Claude Mythos, Deepseek v4, HappyHorse, Meta’s new AI, realtime video games: AI NEWS
Anthropic unveils a withheld Claude Mythos model that autonomously finds thousands of 0-days, ZAI open-sources the 1.5 TB GLM-5.1 that tops open-weight benchmarks, Alibaba’s unreleased HappyHorse video model hits #1 on public leaderboards, and Deepseek teases an “Expert Mode” v4 preview.