@derangineer: the goats in the game

X AI KOLs Following 06/11/26, 10:15 PM News

inference-performance open-source fa4 kernel deep-dive contributions

Summary

Charles Frye announces a blog post detailing contributions to FA4 internals, focusing on inference performance improvements that have been upstreamed.

the goats in the game

Original Article

View Cached Full Text

Cached at: 06/12/26, 06:54 AM

the goats in the game

Charles 🎉 Frye (@charles_irl): Last fall, we shared our deep dive on FA4 internals.

But we didn’t stop at grokking the kernel.

Since then, we’ve been developing improvements for inference performance and upstreaming them.

This blog post explains those contributions.

Similar Articles

@charles_irl: Last fall, we shared our deep dive on FA4 internals. But we didn't stop at grokking the kernel. Since then, we've been …

X AI KOLs Following

A blog post details contributions to FlashAttention-4 to improve its performance for large language model inference, especially for decode-heavy workloads, by adjusting parallelism strategies and supporting irregular memory accesses.

@satvikgari: Wake up the goat posted

X AI KOLs Timeline

A progress update on reinventing Groq's LPU, with a redesigned vector execution module to better support overlap operations and self-attention.

@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…

X AI KOLs Following

Discussion about rewriting parallelism to improve kernel performance using CuTe DSL and tile programming models for the FA4 (FlashAttention 4) kernel.

@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…

X AI KOLs Following

A detailed blog post dissecting ThunderKittens, a compact DSL for high-performance AI kernels, including a bottom-up analysis of its abstractions and a benchmark implementing a non-causal attention prefill kernel that outperforms FlashAttention-2 by ~1.55x and matches FlashAttention-3.

Claude Mythos, Deepseek v4, HappyHorse, Meta’s new AI, realtime video games: AI NEWS

YouTube AI Channels

Anthropic unveils a withheld Claude Mythos model that autonomously finds thousands of 0-days, ZAI open-sources the 1.5 TB GLM-5.1 that tops open-weight benchmarks, Alibaba’s unreleased HappyHorse video model hits #1 on public leaderboards, and Deepseek teases an “Expert Mode” v4 preview.

Similar Articles

@charles_irl: Last fall, we shared our deep dive on FA4 internals. But we didn't stop at grokking the kernel. Since then, we've been …

@satvikgari: Wake up the goat posted

@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…

@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…

Claude Mythos, Deepseek v4, HappyHorse, Meta’s new AI, realtime video games: AI NEWS

Submit Feedback