triton

Tag

Cards List
#triton

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

Hugging Face Blog · 3d ago Cached

This blog post continues the profiling in PyTorch series, exploring nn.Linear, MLP blocks, and fusion techniques using Triton kernels to optimize performance.

0 favorites 0 likes
#triton

@TheAhmadOsman: How to go about learning all of this? 1st: Start with the serving engine view - vLLM: PagedAttention, continuous batchi…

X AI KOLs Following · 5d ago Cached

A detailed guide on learning AI inference engine internals, covering serving engines like vLLM and SGLang, low-level GPU kernel programming with Triton and CUTLASS, and a sequence of mini-projects to build hands-on expertise.

0 favorites 0 likes
#triton

@PyTorch: More details about the tutorial https://pldi26.sigplan.org/details/pldi-2026-tutorials/1/Writing-Performance-Portable-K…

X AI KOLs Following · 2026-06-04 Cached

Helion is a Python DSL that compiles to optimized Triton code for performance-portable GPU kernels. This tutorial at PLDI 2026 covers Helion's architecture, autotuning, and CuteDSL backend.

0 favorites 0 likes
#triton

@PyTorch: PyTorch member Meta just open-sourced a GPU kernel that makes attention 2.3x faster on NVIDIA Blackwell. TLX Block Atte…

X AI KOLs Following · 2026-05-26 Cached

Meta open-sources TLX Block Attention, a warp-specialized Triton kernel that achieves 2.3x speedup for block-diagonal self-attention on NVIDIA Blackwell GPUs, with up to 3.5x speedup when fused with rotary embeddings.

0 favorites 0 likes
#triton

Schanuel's Conjecture and the Semantics of Triton's FPSan

Hacker News Top · 2026-05-16 Cached

FPSan is a Triton compiler pass that enables verification of algebraic equivalence of floating-point programs by replacing floating-point operations with integer operations, relying on Schanuel's conjecture for correctness.

0 favorites 0 likes
#triton

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Hugging Face Daily Papers · 2026-05-06 Cached

KernelBench-X is a new benchmark for evaluating LLM-generated GPU kernels, revealing that task structure impacts correctness more than method design and that correctness does not guarantee hardware efficiency.

0 favorites 0 likes
#triton

AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

arXiv cs.CL · 2026-04-21 Cached

Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.

0 favorites 0 likes
← Back to home

Submit Feedback