Tag
Linux 7.2 kernel merges a performance optimization for anonymous/unnamed pipes, improving throughput by 6-48% and reducing latency by 17-33% by pre-allocating pages outside of mutex lock to avoid contention.
INT21 announced PTX Kernel Factory, a self-improving agent swarm that autonomously generates expert-level PTX GPU kernels, with open-source proof-of-concept implementations and beta access.
A tweet discussing two agentic GPU kernel optimization systems: Auto GPU Kernel by @dogacel0 and Kernel Design Agents from @songhan_mit's lab, both winners at the MLSys Sparse Attention FlashInfer competition. The thread highlights different approaches using subagents and Claude skills for GPU programming.
Explains that inference kernels differ from training, with Flash Attention 4 focusing on changing parallelism across KV and supporting small irregular loads.
A blog post details contributions to FlashAttention-4 to improve its performance for large language model inference, especially for decode-heavy workloads, by adjusting parallelism strategies and supporting irregular memory accesses.
This paper proposes using language models as selective surrogates to optimize GPU kernel runtime, demonstrating a novel approach to performance forecasting.
Alibaba's Qwen3.7-Max model autonomously optimized a production kernel on unfamiliar T-Head PPU hardware over 35 hours, making 1,158 tool calls and achieving a 10x speedup, demonstrating sustained autonomous agentic behavior without human guidance.
A lecture on LLM deployment techniques covering AWQ, vLLM, FlashAttention, quantization, and activation smoothing for efficient serving.
Metal-Sci introduces a 10-task benchmark for optimizing scientific computing kernels on Apple Silicon, paired with an evolutionary search framework driven by large language models. The study evaluates models like Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5, demonstrating significant speedups while using out-of-distribution testing to catch silent performance regressions.
Opus 4.7 auto-generated a custom WebGPU kernel that accelerates Qwen3.5 inference up to 13× via fused LinearAttention, now shipping in Transformers.js v4.2.0.
AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.