@pradheepraop: implemented the top-k kernel from the kernel design section in the msa paper. https://github.com/Mantissagithub/learn_c…

X AI KOLs Timeline Tools

Summary

Implemented a top-k kernel from the kernel design section of the MSA paper, using exp-free comparison and warp-level tree merging with CUDA shuffles. The code is available on GitHub.

implemented the top-k kernel from the kernel design section in the msa paper. https://github.com/Mantissagithub/learn_cuda/blob/msa/07_projects/msa/top_k.cu… it comes with two ideas: - exp-free comparison: no need to compute softmax, since softmax preserves ordering - each warp lane scans a 1/32 stride, keeps a small local top-k, and tree-merges the results with shuffles ended up revising cuda the whole night and also cleaned up my learn_cuda repo, so any feedback/optimizations are welcome.
Original Article
View Cached Full Text

Cached at: 06/15/26, 01:04 PM

implemented the top-k kernel from the kernel design section in the msa paper.

https://github.com/Mantissagithub/learn_cuda/blob/msa/07_projects/msa/top_k.cu…

it comes with two ideas:

  • exp-free comparison: no need to compute softmax, since softmax preserves ordering
  • each warp lane scans a 1/32 stride, keeps a small local top-k, and tree-merges the results with shuffles

ended up revising cuda the whole night and also cleaned up my learn_cuda repo, so any feedback/optimizations are welcome.


Mantissagithub/learn_cuda

Source: https://github.com/Mantissagithub/learn_cuda

learn_cuda

Learning cuda (aka gpu programming)

Similar Articles

@shreyansh_26: https://x.com/shreyansh_26/status/2069125463860302212

X AI KOLs Timeline

This post explains the Decompose-K technique for accelerating skinny large-K matrix multiplications by splitting the K dimension into chunks, running batched matmuls, and summing partials. It provides a PyTorch implementation and benchmarks showing significant speedups over standard torch.compile for bad-shaped matmuls.

Getting peak TOPS on a Ryzen AI 7 350 NPU

Lobsters Hottest

A technical deep-dive into achieving peak TOPS performance on the AMD Ryzen AI 7 350 NPU, comparing it to Xilinx AIE-ML v2 AI engines and explaining the hardware architecture for matrix multiplication workloads.