@PyTorch: More details about the tutorial https://pldi26.sigplan.org/details/pldi-2026-tutorials/1/Writing-Performance-Portable-K…
Summary
Helion is a Python DSL that compiles to optimized Triton code for performance-portable GPU kernels. This tutorial at PLDI 2026 covers Helion's architecture, autotuning, and CuteDSL backend.
View Cached Full Text
Cached at: 06/05/26, 09:20 PM
More details about the tutorial https://pldi26.sigplan.org/details/pldi-2026-tutorials/1/Writing-Performance-Portable-Kernels-Simplified-with-Helion…
Writing Performance-Portable Kernels Simplified with Helion (PLDI 2026 - PLDI Tutorials) - PLDI 2026
This program is tentative and subject to change.
Abstract
Modern machine learning relies heavily on custom kernels for performance, which are often written in hardware-specific languages and create technical debt. Helion addresses this by compiling a high-level Python Domain Specific Language (DSL) into optimized Triton code, automating low-level details and hardware-specific tuning. With its PyTorch-like syntax and autotuning engine, Helion delivers fast, portable performance while significantly reducing development effort. Helion is open-source athttps://github.com/pytorch/helion. This 3-hour tutorial will describe Helion through a series of talk and demonstrations.
This tutorial will describe Helion through a series of talks, demonstrations, and hands-on experiments.
- Introduction to Helion (35 mins): We will provide an overview of Helion, including its underlying motivation, programming model, overall design architecture, and various use cases.
- Compiler Architecture and Integration with TorchInductor (35 mins): The Helion compiler architecture progressively lowers Python functions into highly optimized Triton code, utilizing TorchInductor as its backend. The key stages of this compilation pipeline are Python AST parsing, Type Propagation, Device IR lowering, a series of compiler passes, and finally, code generation. We will detail the integration between Helion and TorchInductor, explaining how this interface enables Helion to target both GPU and non-GPU hardware and how users can incorporate their own custom backends.
- 30-min break: (Time to set up compute for hands-on experiments in the following section)
- Autotuning in Helion (50 mins): A key feature of Helion is its scalable autotuning framework that explores a vast configuration space, where one Helion kernel can map to thousands of Triton kernels. In this session, we detail the configuration space that Helion explores, illustrate how different configs map to Triton code, and examine the various search strategies that Helion utilizes, such as Likelihood-Free Bayesian Optimization and LLM-guided autotuning. Attendees will also have the opportunity to gain hands-on experience with autotuning Helion kernels.
- CuteDSL backend for SOTA NVIDIA performance (30 mins): In this session, we will present the cutting-edge performance we are achieving on NVIDIA GPUs, driven by our ongoing efforts to build the CuteDSL backend in Helion. We will also showcase the agentic development workflow that facilitates these advancements.
This program is tentative and subject to change.
Similar Articles
@PyTorch: On Monday, June 15, PyTorch Foundation project Helion is hosting a Helion DSL Tutorial at PLDI 2026 (47th ACM SIGPLAN C…
The PyTorch Foundation project Helion is hosting a Helion DSL Tutorial at PLDI 2026 in Denver. It's an interactive workshop for compiler researchers, kernel authors, and ML systems engineers to write, autotune, and run Helion kernels.
@leloykun: [WIP] Blog post on Lean4-to-TileLang Tensor Program Superoptimizer here:
A technical blog post introduces a Lean4-to-TileLang tensor program superoptimizer that automatically generates optimized GPU/TPU kernels and hyperparameter scaling laws, demonstrating performance gains over torch.compile.
@charles_irl: New articles in the GPU Glossary for CuTe DSL, CUTLASS, and CuTe -- the tools used to write some of the highest-perform…
New articles in the GPU Glossary cover CuTe DSL, CUTLASS, and CuTe – tools for writing high-performance GPU kernels on data center GPUs, with examples in Python.
@ManningBooks: PyTorch gets you pretty far, but when performance becomes the problem, understanding what's happening at the GPU level …
Promotional post for the book 'CUDA for Deep Learning' by Elliot Arledge, offering a first chapter summary video that explains GPU performance, the CUDA programming model, and when to write custom CUDA kernels.
C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]
Discussion of the shift in GPU kernel engineering from C++ CuTe/CUTLASS to NVIDIA's Python-based CuTeDSL, questioning whether new engineers should learn legacy C++ templates or prioritize the emerging stack for LLM inference work.