@levidiamode: 163/365 of GPU Programming Looking at a few different agentic GPU kernel optimization systems today. The two I'm most i…
Summary
A tweet discussing two agentic GPU kernel optimization systems: Auto GPU Kernel by @dogacel0 and Kernel Design Agents from @songhan_mit's lab, both winners at the MLSys Sparse Attention FlashInfer competition. The thread highlights different approaches using subagents and Claude skills for GPU programming.
View Cached Full Text
Cached at: 06/16/26, 01:13 AM
163/365 of GPU Programming
Looking at a few different agentic GPU kernel optimization systems today. The two I’m most interested in atm:
- @dogacel0’s Auto GPU Kernel which he used to win the DeepSeek Sparse Attention FlashInfer challenge at MLSys this year
- Kernel Design Agents out of @songhan_mit’s lab which won 1st place in the MoE track of the same competition
Really interesting to see the different uses of subagents and Claude skills like the Kernel Wiki to optimize these agentic loops for GPU programming. Some great inspiration in both for my own workflows
Links to repos:
- https://github.com/Dogacel/auto-gpu-kernel…
- https://github.com/mit-han-lab/kernel-design-agents…
Similar Articles
AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents
AgentKernelArena is an open-source benchmark for evaluating AI coding agents on GPU kernel optimization, assessing full agent workflows and generalization to unseen configurations across 196 tasks.
@levidiamode: 157/365 of GPU Programming Another FlashAttention4 resource that's been really helpful for me is the talk @charles_irl …
A daily GPU programming thread highlights a talk by Charles_irl that reverse-engineers FlashAttention4 code before the paper release, praising the Modal team's deep code dissection and inferences about the forward pass.
@levidiamode: Day 138/365 of GPU Programming One of my favorite lectures I've watched this year is Stanford's CS336 lecture 7 on GPU …
A learner shares enthusiasm for Stanford CS336 lecture 7 on GPU parallelism, which covers fundamental operations and connects them to multi-GPU setups and parallelism techniques like tensor, data, and pipeline parallelism.
@levidiamode: 158/365 of GPU Programming I think I understand the high level differences between the FlashAttention 2, 3 and 4 forwar…
The author documents their progress in learning GPU programming, focusing on understanding the high-level differences between FlashAttention 2, 3, and 4 forward passes, and lists several low-level concepts they need to explore further.
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.