@pradheepraop: implemented the top-k kernel from the kernel design section in the msa paper. https://github.com/Mantissagithub/learn_c…
Summary
Implemented a top-k kernel from the kernel design section of the MSA paper, using exp-free comparison and warp-level tree merging with CUDA shuffles. The code is available on GitHub.
View Cached Full Text
Cached at: 06/15/26, 01:04 PM
implemented the top-k kernel from the kernel design section in the msa paper.
https://github.com/Mantissagithub/learn_cuda/blob/msa/07_projects/msa/top_k.cu…
it comes with two ideas:
- exp-free comparison: no need to compute softmax, since softmax preserves ordering
- each warp lane scans a 1/32 stride, keeps a small local top-k, and tree-merges the results with shuffles
ended up revising cuda the whole night and also cleaned up my learn_cuda repo, so any feedback/optimizations are welcome.
Mantissagithub/learn_cuda
Source: https://github.com/Mantissagithub/learn_cuda
learn_cuda
Learning cuda (aka gpu programming)
Similar Articles
AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation
Researchers from Carnegie Mellon, University of Washington, and Arm propose AdaExplore, an LLM agent framework for GPU kernel code generation that achieves 3.12× and 1.72× speedups on KernelBench Level-2 and Level-3 benchmarks through failure-driven adaptation and diversity-preserving search, without additional fine-tuning.
@shreyansh_26: https://x.com/shreyansh_26/status/2069125463860302212
This post explains the Decompose-K technique for accelerating skinny large-K matrix multiplications by splitting the K dimension into chunks, running batched matmuls, and summing partials. It provides a PyTorch implementation and benchmarks showing significant speedups over standard torch.compile for bad-shaped matmuls.
Getting peak TOPS on a Ryzen AI 7 350 NPU
A technical deep-dive into achieving peak TOPS performance on the AMD Ryzen AI 7 350 NPU, comparing it to Xilinx AIE-ML v2 AI engines and explaining the hardware architecture for matrix multiplication workloads.
@raphaelsrty: Computing max similarity (scoring step of colbert, colpali) on gpus can be optimized and this is what @tonywu_71 did. I…
Tony Wu released late-interaction-kernels (LIK): fused Triton kernels for MaxSim, the scoring step behind ColBERT and ColPali, integrated into PyLate and colpali-engine, offering memory efficiency and performance gains.
@levidiamode: 163/365 of GPU Programming Looking at a few different agentic GPU kernel optimization systems today. The two I'm most i…
A tweet discussing two agentic GPU kernel optimization systems: Auto GPU Kernel by @dogacel0 and Kernel Design Agents from @songhan_mit's lab, both winners at the MLSys Sparse Attention FlashInfer competition. The thread highlights different approaches using subagents and Claude skills for GPU programming.