@pradheepraop: implemented the top-k kernel from the kernel design section in the msa paper. https://github.com/Mantissagithub/learn_c…

X AI KOLs Timeline 06/15/26, 07:44 AM Tools

top-k kernel-design cuda gpu-programming attention open-source

Summary

Implemented a top-k kernel from the kernel design section of the MSA paper, using exp-free comparison and warp-level tree merging with CUDA shuffles. The code is available on GitHub.

implemented the top-k kernel from the kernel design section in the msa paper. https://github.com/Mantissagithub/learn_cuda/blob/msa/07_projects/msa/top_k.cu… it comes with two ideas: - exp-free comparison: no need to compute softmax, since softmax preserves ordering - each warp lane scans a 1/32 stride, keeps a small local top-k, and tree-merges the results with shuffles ended up revising cuda the whole night and also cleaned up my learn_cuda repo, so any feedback/optimizations are welcome.

Original Article

View Cached Full Text

Cached at: 06/15/26, 01:04 PM

implemented the top-k kernel from the kernel design section in the msa paper.

https://github.com/Mantissagithub/learn_cuda/blob/msa/07_projects/msa/top_k.cu…

it comes with two ideas:

exp-free comparison: no need to compute softmax, since softmax preserves ordering
each warp lane scans a 1/32 stride, keeps a small local top-k, and tree-merges the results with shuffles

ended up revising cuda the whole night and also cleaned up my learn_cuda repo, so any feedback/optimizations are welcome.

Mantissagithub/learn_cuda

Source: https://github.com/Mantissagithub/learn_cuda

learn_cuda

Learning cuda (aka gpu programming)

Similar Articles

AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation

arXiv cs.CL

Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.

@shreyansh_26: https://x.com/shreyansh_26/status/2069125463860302212

X AI KOLs Timeline

This post explains the Decompose-K technique for accelerating skinny large-K matrix multiplications by splitting the K dimension into chunks, running batched matmuls, and summing partials. It provides a PyTorch implementation and benchmarks showing significant speedups over standard torch.compile for bad-shaped matmuls.

Getting peak TOPS on a Ryzen AI 7 350 NPU

Lobsters Hottest

A technical deep-dive into achieving peak TOPS performance on the AMD Ryzen AI 7 350 NPU, comparing it to Xilinx AIE-ML v2 AI engines and explaining the hardware architecture for matrix multiplication workloads.

@raphaelsrty: Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did. I…

X AI KOLs Following

Tony Wu released late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT and ColPali, integrated into PyLate and colpali-engine, offering memory efficiency and performance gains.

@levidiamode: 163/365 of GPU Programming Looking at a few different agentic GPU kernel optimization systems today. The two I'm most i…