@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…
Summary
A tweet showcasing a CuTe DSL kernel sample that uses layouts to express transposition, part of the FlashAttention-4 kernel.
Similar Articles
@charles_irl: The CuTe and CuTe DSL articles include minimal code snippets illustrating core principles and basic usage. These snippe…
The CuTe and CuTe DSL articles provide minimal code snippets with Modal Notebooks for hands-on learning.
@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…
Discussion about rewriting parallelism to improve kernel performance using CuTe DSL and tile programming models for the FA4 (FlashAttention 4) kernel.
@charles_irl: New articles in the GPU Glossary for CuTe DSL, CUTLASS, and CuTe -- the tools used to write some of the highest-perform…
New articles in the GPU Glossary cover CuTe DSL, CUTLASS, and CuTe – tools for writing high-performance GPU kernels on data center GPUs, with examples in Python.
C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]
Discussion of the shift in GPU kernel engineering from C++ CuTe/CUTLASS to NVIDIA's Python-based CuTeDSL, questioning whether new engineers should learn legacy C++ templates or prioritize the emerging stack for LLM inference work.
@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…
A detailed blog post dissecting ThunderKittens, a compact DSL for high-performance AI kernels, including a bottom-up analysis of its abstractions and a benchmark implementing a non-causal attention prefill kernel that outperforms FlashAttention-2 by ~1.55x and matches FlashAttention-3.