Tag
This blog post continues the profiling in PyTorch series, exploring nn.Linear, MLP blocks, and fusion techniques using Triton kernels to optimize performance.
A detailed guide on learning AI inference engine internals, covering serving engines like vLLM and SGLang, low-level GPU kernel programming with Triton and CUTLASS, and a sequence of mini-projects to build hands-on expertise.
Helion is a Python DSL that compiles to optimized Triton code for performance-portable GPU kernels. This tutorial at PLDI 2026 covers Helion's architecture, autotuning, and CuteDSL backend.
Meta open-sources TLX Block Attention, a warp-specialized Triton kernel that achieves 2.3x speedup for block-diagonal self-attention on NVIDIA Blackwell GPUs, with up to 3.5x speedup when fused with rotary embeddings.
FPSan is a Triton compiler pass that enables verification of algebraic equivalence of floating-point programs by replacing floating-point operations with integer operations, relying on Schanuel's conjecture for correctness.
KernelBench-X is a new benchmark for evaluating LLM-generated GPU kernels, revealing that task structure impacts correctness more than method design and that correctness does not guarantee hardware efficiency.
Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.