@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…

X AI KOLs Following 05/26/26, 03:22 PM Tools

cuda dsl flash-attention gpu-kernel linear-algebra transposition

Summary

A tweet showcasing a CuTe DSL kernel sample that uses layouts to express transposition, part of the FlashAttention-4 kernel.

^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe kernel, with a cute trick: using layouts to express transposition. https://modal.com/notebooks/modal-labs/examples/nb-owEUD0kdSVeL4KeEX5sjh1…

Original Article

Similar Articles

@charles_irl: The CuTe and CuTe DSL articles include minimal code snippets illustrating core principles and basic usage. These snippe…

X AI KOLs Following

The CuTe and CuTe DSL articles provide minimal code snippets with Modal Notebooks for hands-on learning.

@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…

X AI KOLs Following

Discussion about rewriting parallelism to improve kernel performance using CuTe DSL and tile programming models for the FA4 (FlashAttention 4) kernel.

@charles_irl: New articles in the GPU Glossary for CuTe DSL, CUTLASS, and CuTe -- the tools used to write some of the highest-perform…

X AI KOLs Following

New articles in the GPU Glossary cover CuTe DSL, CUTLASS, and CuTe – tools for writing high-performance GPU kernels on data center GPUs, with examples in Python.

C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]

Reddit r/MachineLearning

Discussion of the shift in GPU kernel engineering from C++ CuTe/CUTLASS to NVIDIA's Python-based CuTeDSL, questioning whether new engineers should learn legacy C++ templates or prioritize the emerging stack for LLM inference work.

@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…

X AI KOLs Following

A detailed blog post dissecting ThunderKittens, a compact DSL for high-performance AI kernels, including a bottom-up analysis of its abstractions and a benchmark implementing a non-causal attention prefill kernel that outperforms FlashAttention-2 by ~1.55x and matches FlashAttention-3.

Similar Articles

@charles_irl: The CuTe and CuTe DSL articles include minimal code snippets illustrating core principles and basic usage. These snippe…

@charles_irl: Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL. FA4 is a very…

@charles_irl: New articles in the GPU Glossary for CuTe DSL, CUTLASS, and CuTe -- the tools used to write some of the highest-perform…

C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]

@hamzaelshafie: New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post…

Submit Feedback