gpu-kernel

#gpu-kernel

EGG: An Expert-Guided Agent Framework for Kernel Generation

arXiv cs.AI ↗ · 6h ago Cached

EGG is an expert-guided agent framework that decomposes GPU kernel generation into algorithmic structure design and hardware-specific tuning, using a stage-aware multi-agent collaboration mechanism. It achieves a 2.13x average speedup over PyTorch on KernelBench and real-world workloads.

0 favorites 0 likes

#gpu-kernel

@shreyansh_26: https://x.com/shreyansh_26/status/2069125463860302212

X AI KOLs Timeline ↗ · 3d ago Cached

This post explains the Decompose-K technique for accelerating skinny large-K matrix multiplications by splitting the K dimension into chunks, running batched matmuls, and summing partials. It provides a PyTorch implementation and benchmarks showing significant speedups over standard torch.compile for bad-shaped matmuls.

0 favorites 0 likes

#gpu-kernel

@ChengleiSi: Excited to share these preliminary results on our internal autoresearch system @Recursive_SI, where we achieve SOTA on …

X AI KOLs Following ↗ · 2026-06-11 Cached

Recursive's automated AI research system achieves state-of-the-art results on NanoChat, NanoGPT Speedrun, and GPU kernel benchmarks by automating the research loop without task-specific adaptations, and open-sourcing artifacts for further inspection.

0 favorites 0 likes

#gpu-kernel

@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…

X AI KOLs Following ↗ · 2026-05-26

A tweet showcasing a CuTe DSL kernel sample that uses layouts to express transposition, part of the FlashAttention-4 kernel.

0 favorites 0 likes

#gpu-kernel

@PyTorch: PyTorch member Meta just open-sourced a GPU kernel that makes attention 2.3x faster on NVIDIA Blackwell. TLX Block Atte…

X AI KOLs Following ↗ · 2026-05-26 Cached

Meta open-sources TLX Block Attention, a warp-specialized Triton kernel that achieves 2.3x speedup for block-diagonal self-attention on NVIDIA Blackwell GPUs, with up to 3.5x speedup when fused with rotary embeddings.

0 favorites 0 likes

#gpu-kernel

@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…

X AI KOLs Following ↗ · 2026-05-21 Cached

CODA introduces a GPU kernel abstraction that rewrites transformer computations as GEMM-plus-epilogue programs, reducing memory-bound operations and improving efficiency in training.

0 favorites 0 likes

#gpu-kernel

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Reddit r/LocalLLaMA ↗ · 2026-04-22

MoonshotAI released FlashKDA, open-source CUTLASS kernels for Kimi Delta Attention that deliver up to 2.22x speedup over Triton on H20 GPUs.

0 favorites 0 likes

gpu-kernel

EGG: An Expert-Guided Agent Framework for Kernel Generation

@shreyansh_26: https://x.com/shreyansh_26/status/2069125463860302212

@ChengleiSi: Excited to share these preliminary results on our internal autoresearch system @Recursive_SI, where we achieve SOTA on …

@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…

@PyTorch: PyTorch member Meta just open-sourced a GPU kernel that makes attention 2.3x faster on NVIDIA Blackwell. TLX Block Atte…

@HanGuo97: Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao! PDF: https://a…

Moonshot open-sourced FlashKDA, CUTLASS kernels for Kimi Delta Attention, up to 2.22x over the Triton baseline on H20

Submit Feedback