Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering
Summary
Flash-GMM introduces a fused Triton kernel for Gaussian Mixture Models that achieves 20x speedup and enables training on datasets 100x larger on a single GPU, making soft clustering a viable drop-in replacement for k-means in approximate nearest neighbor search.
View Cached Full Text
Cached at: 06/12/26, 10:52 AM
Paper page - Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering
Source: https://huggingface.co/papers/2606.10896
Abstract
Flash-GMM introduces an efficient fused Triton kernel for Gaussian Mixture Models that achieves significant speedup and enables processing much larger datasets on a single GPU.
We present Flash-GMM, a fusedTriton kernelfor efficient computation ofGaussian Mixture Models(GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the fullresponsibility matrixin GPU memory, Flash-GMM achieves a 20times speedup over existing implementations and enables training on datasets more than 100times larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into theIVF coarse quantizerfor approximate nearest-neighbor (ANN) search. We show that soft GMM clustering is now a viable drop-in replacement fork-means, and that GMM responsibilities can be leveraged to assign border vectors to multiple clusters. Our approach reaches fixed recall targets with up to 1.7times fewerdistance computations, or equivalently, yields +2--12recall@10at matched computational cost. We release the kernel as an open-source project.
View arXiv pageView PDFGitHub11Add to collection
Get this paper in your agent:
hf papers read 2606\.10896
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.10896 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.10896 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.10896 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@Andy_ShuoYang: Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for f…
The Flash-KMeans team releases FlashLib, a GPU library for classical ML operators that achieves up to 208x speedups over cuML on Hopper GPUs, with a focus on fast, predictable performance for agentic AI workloads.
@neural_avb: Deep learning bros and sisters, don't sleep on this. You can cluster millions of documents in embedding space, mass-ann…
Shuo Yang and team release FlashLib, a GPU library that accelerates classical ML operators like KMeans, KNN, HDBSCAN, PCA, and t-SNE, claiming speedups up to 208x.
TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization
TideGS introduces an out-of-core training framework that enables 3D Gaussian Splatting with over one billion primitives on a single GPU by managing parameters across SSD-CPU-GPU hierarchy via block-virtualization, asynchronous pipeline, and differential streaming techniques.
@Kimi_Moonshot: We're open-sourcing FlashKDA — our high-performance CUTLASS-based implementation of Kimi Delta Attention kernels. Achie…
Moonshot AI releases FlashKDA, an open-source CUTLASS-based implementation of Kimi Delta Attention kernels that delivers 1.72×–2.22× prefill speedup on H20 GPUs.
ZipSplat: Fewer Gaussians, Better Splats
ZipSplat is a token-based feed-forward 3D Gaussian Splatting model that uses k-means clustering to decouple Gaussian placement from the pixel grid, achieving ~6x fewer Gaussians while setting new state-of-the-art results on DL3DV and RealEstate10K without requiring ground-truth poses or intrinsics.