@charles_irl: New articles in the GPU Glossary for CuTe DSL, CUTLASS, and CuTe -- the tools used to write some of the highest-perform…

X AI KOLs Following 05/26/26, 03:22 PM News

gpu-glossary cute-dsl cutlass cute gpu-programming kernel high-performance

Summary

New articles in the GPU Glossary cover CuTe DSL, CUTLASS, and CuTe – tools for writing high-performance GPU kernels on data center GPUs, with examples in Python.

New articles in the GPU Glossary for CuTe DSL, CUTLASS, and CuTe -- the tools used to write some of the highest-performance kernels on contemporary data center GPUs. https://modal.com/gpu-glossary/host-software/cute-dsl…

Original Article

View Cached Full Text

Cached at: 05/26/26, 06:56 PM

New articles in the GPU Glossary for CuTe DSL, CUTLASS, and CuTe – the tools used to write some of the highest-performance kernels on contemporary data center GPUs. https://modal.com/gpu-glossary/host-software/cute-dsl…

What is CuTe DSL? | GPU Glossary

Source: https://modal.com/gpu-glossary/host-software/cute-dsl CuTe DSL is a Python-based Domain-Specific Language (DSL) for writing and dynamically compilingkernelsat high performance and with high developer productivity.

CuTe DSL is part ofCUTLASS, a collection ofCUDA C++templates and DSLs. UnlikecuBLASorcuDNN, which provide ready-to-call kernels for common operations, the CUTLASS stack provides tools for composably defining high-performance kernels.

The core abstractions of CuTe DSL include layouts, tensors, hardware atoms, and tiled operations. Layouts describe how data is organized in memory and across threads. Tensors combine data pointers or iterators with layout metadata. Atoms represent fundamental hardware operations such as matrix multiply-accumulate (MMA) or memory copy. Tiled operations describe how atoms are applied acrossthread blocksandwarps. For the underlying details, seeCuTe.

When launching a CuTe DSL kernel from Python, the Python program calls a@cute\.jitfunction, and that function launches a@cute\.kernelfunction.

The@cute\.jitdecorator declares a JIT-compiled function that can be called from Python or from other CuTe DSL functions. The@cute\.kerneldecorator defines a GPU kernel function that can be launched from a@cute\.jitfunction. Python code cannot call a@cute\.kernelfunction directly.

For example, let’s look at a naive (unoptimized) CuTe DSL kernel for elementwise addition of two one-dimensional tensors -- the “hello world” for GPU programming that goes back toIan Buck’s Brook frameworkthat preceded and inspiredCUDA. You can edit this kernel and execute it on a B200 GPU usingthis Modal Notebook.

python

import cutlass.cute as cute
import torch

Tensor = cute.Tensor | torch.Tensor

@cute.kernel
def elem_add_kernel(a: cute.Tensor, b: cute.Tensor, out: cute.Tensor):
    block_x, _, _ = cute.arch.block_idx()
    block_dim_x, _, _ = cute.arch.block_dim()
    thread_x, _, _ = cute.arch.thread_idx()

    i = block_x * block_dim_x + thread_x

    if i < out.shape[0]:
        out[i] = a[i] + b[i]

@cute.jit
def elem_add(a: Tensor, b: Tensor, out: Tensor):
    n = out.shape[0]
    threads_per_block = 128
    blocks = (n + threads_per_block - 1) // threads_per_block

    elem_add_kernel(a, b, out).launch(
        grid=(blocks, 1, 1),
        block=(threads_per_block, 1, 1),
    )

Theelem\_add\_kernelfunction is thekernel. Eachthreadcomputes one output element. The global element indexiis computed from thethread blockindex, the number of threads in the block, and the thread index inside the block:

python

i = block_x * block_dim_x + thread_x

Theelem\_addfunction computes the number of thread blocks needed to cover the output tensor and launches the kernel with a one-dimensionalthread block grid.

This example is pedagogical, not optimized. Even so, it shows a good basic access pattern: adjacent threads read adjacent elements ofaandb, then write adjacent elements ofout. That is the pattern needed for coalesced accesses toglobal memory; seememory coalescing.

Layout concerns are one reason why CuTe DSL is useful for high-performance kernels. Engineering forperformanceis difficult because kernels must be closely mapped to hardware: which threads handle which data, how memory is accessed, how work is tiled, and which hardware operations the generated code should use. CuTe DSL allows programmers to express these mappings explicitly while reusing much of the same kernel code across a variety of shapes andStreaming Multiprocessor architectures.

This may be surprising to performance-focused engineers from other domains -- how can a program written in an interpreted language like Python hope to compete with programs written in compiled languages?

The answer is that CuTe DSL kernels are compiled, Just-In-Time (JIT). Python source code is converted to an abstract syntax tree (AST), traced with proxy arguments, and then compiled. Note that only a subset of Python semantics are supported in JIT-compiled code.

At time of writing, in CUTLASS 4.x, the compilation stack passes throughMulti-Level Intermediate Representation (MLIR)to thePTXIR to device-specificSASSbefore being executed.

Consider theFlashAttention-4kernels. Ourwriteupof the open source code walks through how it uses pipelined warp specialization,Tensor Coreoperations, andTensor Memory&Tensor Memory Acceleratoroperations to achieve state-of-the-art performance directly from CuTe DSL.

For more details on CuTe DSL, see NVIDIA’sCuTe DSL documentationandCuTe DSL overview blog.

@charles_irl: New articles in the GPU Glossary for CuTe DSL, CUTLASS, and CuTe -- the tools used to write some of the highest-perform…

What is CuTe DSL? | GPU Glossary

Similar Articles

C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]

@charles_irl: The CuTe and CuTe DSL articles include minimal code snippets illustrating core principles and basic usage. These snippe…

@pauliusztin_: I just found one of the most useful resources for understanding GPUs. No more jumping between random docs, PDFs, and fo…

@gpuwaster: 61/100 of GPU Grind going through this GTC 2020 lecture: Developing CUDA kernels to push Tensor Cores to the Absolute L…

@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…

Submit Feedback

Similar Articles

C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]

@charles_irl: The CuTe and CuTe DSL articles include minimal code snippets illustrating core principles and basic usage. These snippe…

@pauliusztin_: I just found one of the most useful resources for understanding GPUs. No more jumping between random docs, PDFs, and fo…

@gpuwaster: 61/100 of GPU Grind going through this GTC 2020 lecture: Developing CUDA kernels to push Tensor Cores to the Absolute L…

@charles_irl: ^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe ker…