@Andy_ShuoYang: Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for f…

X AI KOLs Following Tools

Summary

The Flash-KMeans team releases FlashLib, a GPU library for classical ML operators that achieves up to 208x speedups over cuML on Hopper GPUs, with a focus on fast, predictable performance for agentic AI workloads.

Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators. Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML). Blog: https://flashml-org.github.io Code: https://github.com/FlashML-org/flashlib…
Original Article
View Cached Full Text

Cached at: 05/27/26, 03:00 AM

Flash-KMeans was only the beginning.

Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for fast, predictable, agent-ready classical ML operators.

Up to 26× on KMeans, 19× on KNN, 40× on HDBSCAN, 208× on TruncatedSVD, 47× on PCA, 147× on exact t-SNE, and 49× on MultinomialNB over state-of-the-art (cuML).

Blog: https://flashml-org.github.io Code: https://github.com/FlashML-org/flashlib…


FlashLib: Bringing Flash Magic to Classical Machine Learning Operators

Source: https://flashml-org.github.io/ Shuo Yang1, Haocheng Xi1, Yilong Zhao1, Qiuyang Mang1, Zhe Wang2, Shanlin Sun2, Kurt Keutzer1, Joseph E. Gonzalez1, Song Han3, Chenfeng Xu4,*, Ion Stoica1,*

Code:github.com/FlashML-org/flashlib

26×KMeans

19×KNN

208×TruncatedSVD

47×PCA

7×UMAP

40×HDBSCAN

147×t-SNE (exact)

49×MultinomialNB

IntroducingFlashLib— a GPU library for classical machine learning operators on modern hardwares, rebuilt for today’s ML workloads and emerging agentic AI systems. Here are a few headline results from the first release:

  • Significant wins over cuML on Hopper GPUs: up to**26×**on KMeans,**19×**on KNN,**40×**on HDBSCAN,**208×**on TruncatedSVD,**47×**on PCA,**147×on exact t-SNE, and49×**on MultinomialNB.
  • Flash informative API: Predict runtime, memory footprint, and overhead for any workload in**~5 µs on pure CPU**, with no GPU profiling required.
  • Fast cold start, built to scale: FlashLib uses heuristic kernel selection to avoid long autotune loops, and already supports multi-GPU execution for large workloads.
  • Toward optimal hardware utilization: FlashLib drives kernels much closer to the limits of modern GPUs, with Flash-KMeans reaching up to61% of peak FLOPsand Flash-KNN reaching up to85.2% of peak HBM bandwidthon H200.

The next frontier of AI efficiency is not just faster LLM inference. It is faster intelligence assembly. For the past few years, MLsys work largely followed a model-centric view of intelligence. As LLMs became stronger through better reasoning, larger-scale test-time compute, and more capable inference, the systems community focused on making the transformer core faster: FlashAttention, FlashDecoding, KV-cache management, and LLM serving systems etc.

But the rise of agentic AI is changing the bottleneck. Modern intelligence is increasingly built around the base model through tools, harnesses, retrieval, verification, search, and orchestration. The LLM is no longer merely a standalone reasoner; it becomes a controller over a broader computational system. As a result, the performance bottleneck is no longer confined to transformer inference. It extends to the entire computational substrate surrounding the model. For example, in Agentic AI for Science, LLM agents may generate hypotheses or candidate solutions, but the surrounding loop often depends on search, clustering, nearest-neighbor retrieval, PCA, SVD, and other classical ML operators for verification and feedback. In multimodal generation and physical AI, models must increasingly process, compress, retrieve, and reorganize streaming features on the fly before they enter the model. These examples point to a broader shift: classical ML operators are becoming core primitives around the LLM model. We envision future agentic workflows where clustering, retrieval, dimensionality reduction, verification, and linear algebra are no longer offline utilities, but online primitives in the critical path of intelligence assembly. Figure 1 illustrates this shift.

Five classical ML operators migrating from batch latency tier into millisecond serving tier over a decade, with refined labelsSame latency chart as before, with two label refinements: Video generation is now Streaming video generation, and PCA-based KV compression is shortened to PCA-based compression.K-meansk-NNTruncatedSVDPCAHDBSCAN1 ms10 ms100 ms1 s1 min1 hr20152018202120242027Year operator entered this latency tierUser segmentationFeature reductionTopic modelingDoc clusteringItem-item recsysPipeline PCAEmbedding compressTopic discoveryRAG retrievalSemantic cachePCA-based compressionSVD-based compressionStreaming video generationKV cache clusteringAgent tool routing Figure 1.The latency budget for classical ML operators (KMeans, k-NN, TruncatedSVD, PCA, HDBSCAN) has been falling steadily over the past decade, on a log scale. The same primitives that used to run offline at the minute-to-hour tier (user segmentation, topic modeling, batch feature reduction) are now being called inside online serving paths (RAG retrieval, semantic cache, KV-cache clustering, agent tool routing) where the budget is measured in milliseconds. As this trend continues, the systems community needs implementations of these operators that are fast, hardware-efficient, reliable, and numerically faithful enough to sit in the critical path.*Hover (or tap) any point to see the specific work it represents.*However, the underlying implementations of these classical operators have not kept pace with this shift. Their core design assumptions still come from the pre-FlashAttention, pre-Hopper, pre-agent era, which creates a four-way mismatch. First, many operators carry natural implementations that are unfriendly to GPUs. Second, many libraries ship one static kernel implementation across all workloads and hardware tiers, leaving modern GPU hardware features unexploited. Third, many libraries are unaware of the user’s precision needs: they expose no way to declare a precision budget, leaving users unable to ask for the fastest algorithm that meets their tolerance. Fourth, the performance is black box: costly to profile, hard to modify, and impossible to budget without first reading the codebase, which leaves both developers and LLM-based agents in the dark.

FlashLib is our attempt to close these gaps and accelerate this emerging substrate, making it fast enough to sit inside the loop of agentic AI. It transforms classical ML operators from slow, offline utilities into fast, online ML primitives. Moreover, FlashLib exposes flash-informative APIs that reveal the cost, tolerance, and execution behavior of these primitives to higher-level agentic pipelines, thus enabling better scheduling and orchestration. We would also like to point out that while FlashLib is motivated by the emerging needs of LLM-centric and agentic AI systems, we also recognize that classical ML algorithms remain widely used across today’s machine learning stacks. Beyond generative AI, operators such as KMeans, KNN, PCA, SVD, t-SNE, and HDBSCAN etc are still core building blocks for recommendation systems, retrieval pipelines, scientific computing, anomaly detection, visualization, and preprocessing for downstream ML models. FlashLib provides a fast, easy-to-use, and adaptive software stack that covers these diverse applications with plug-and-play GPU acceleration.

We built FlashLib around four design principles. First, we reshape the algorithm to fit the hardware while achieving mathematical equivalence. Second, we build kernel variants per operator to fully exploit different workloads on different hardwares using modern hardware features. Third, we let users declare a precision budget and route to the fastest algorithm that meets it. Fourth, we keep the entire library transparent enough that users and LLM agents can easily read, compose, and modify the kernels.

01 / 04Reformulation

Mathematically Equivalent Reformulation: Rewriting Operators to Be GPU-Friendly

Many classical ML operators have natural implementations that are unfriendly to GPUs: they materialize large intermediates in HBM, introduce atomic contention, or run reductions along dimensions that don’t tile well. FlashLib’s first principle is to rewrite these into mathematically equivalent forms that are friendly to modern accelerators. KMeans assign is the clearest example: the natural implementation forms anN×Kdistance matrix in HBM and runs anargminper row, but the streaming-fused version keeps the running local minima in registers and never materializes the matrix. The same pattern recurs across the library: KNN’s fused top-K skips the‖x‖2term in‖xy‖2= ‖x‖2+ ‖y‖2− 2⟨x,y⟩, PCA’s dual-Gram routing picks the smaller ofXX(D×D)andX X⊤(N×N), avoiding the wastedO(max(N,D)3)eighthat cuML’s fixedD×Dpath runs on wide data, MultinomialNB changes atomic scatter for segment-level reduction, and t-SNE’s gradient never materializes theN×NQ matrix.

02 / 04Hardware-Aware Kernels

Hardware-Aware Implementation: Kernel Variants for Different Hardware and Workloads

To map these mathematical formulations directly to silicon, FlashLib builds multiple kernel variants that adapt to both the hardware and the workload. Flash-KNN illustrates this approach. First, at the backend layer, we ship a portable Triton implementation for both Ampere and Hopper. For Hopper, an opt-in CuteDSL FA3 backend additionally unlocks modern features like TMA fetching and warp-specialized pipelines. Second, at the kernel layer, the design adapts to the workload. For large queries, the kernel mirrors standard FlashAttention to maximize TensorCore utilization. For small queries against a huge corpus, it mirrors Flash-Decoding: a split-k layout distributes work across the corpus dimension to prevent SM idling. Third, at the heuristic layer, we choose hyperparameters like tile sizes and warp counts based on hardware characteristics such as cache size and register capacity. As a result, even aQ=1query against a 100M-vector corpus holds the kernel at85.2% of the H200’s peak HBM bandwidth.

03 / 04Tolerance Routing

Tolerance-Driven Dispatch: Routing to the Fastest Algorithm within a Precision Budget

FlashLib also exposes the speed-accuracy tradeoff as a user choice. Classical scientific computing often demands high precision in FP32 or even FP64 — for solving PDEs, certifying numerical methods, or anywhere a small error cascades into a wrong answer. Many AI workloads have no such requirement: a clustering pass over embedding vectors, a top-K retrieval, or a regression on noisy data can absorb a small declared residual for a substantial speedup. FlashLib makes that distinction the user’s to draw, through a single per-call argument,tol. Attol=None, reductions stay in exact precision and the call rides the kernel-fusion wins from above. Attol \> 0, the dispatcher routes through a Pareto frontier of precision emulation (fused variants like3xbf16and Ozaki-II INT8) and algorithm substitution (Halko subspace iteration), picking whichever has the highest throughput within the declared residual.

# GEMM: same call, different tol -> different variant.
flashlib.gemm(A, B)               # exact fp32
flashlib.gemm(A, B, tol=1e-3)     # bf16
flashlib.gemm(A, B, tol=1e-5)     # 3xbf16 (cute-fused)
flashlib.gemm(A, B, tol=1e-7)     # ozaki2_cute(s=8): tighter AND faster
flashlib.gemm(A, B, tol=1e-12)    # ozaki2_int8(s=14): FP64-grade

# PCA: tol unlocks an algorithm substitution, not just precision.
flashlib.flash_pca(X, K=32)            # exact eigh on Gram / cov matrix
flashlib.flash_pca(X, K=32, tol=1e-4)  # Halko subspace: ~30x faster

04 / 04Cost-Predictable API

Agent-Native API: Transparent Source and Predictable Cost for Users and Agents

In an era when LLM-based agents increasingly read, call, and modify performance code, the cost of a library is not just its kernel throughput but how legible its cost model and source are. FlashLib is written in Triton and CuteDSL with no opaque binaries — every kernel fromflash\_kmeans\(\.\.\.\)down to thetl\.dotcall is editable. And every primitive ships a GPU-free cost-prediction surface:flashlib\.info\.estimate\(\.\.\.\)takes a shape and a tolerance and returns a recursive tree of runtime, FLOPs, HBM bytes, and bound regime in ~5 microseconds on pure CPU, never importing torch, triton, or cutlass. An LLM agent can compose a pipeline of ten primitives, walk that cost tree, and decide whether the budget fits before spending a single FLOP.

import flashlib.info as info   # pure stdlib -- no torch/triton/cutlass.

# Predict cost without touching the GPU -- ~5 microseconds on pure CPU.
est = info.estimate("pca", shape=(1_000_000, 512), params={"K": 32}, device="H200")

print(est.summary_line())
# pca   13.18 ms  bound=compute   42 TF  (83% peak)  res~1e-7  [roofline]

est.print_tree()              # walk the recursive call-stack tree
# pca           13.18 ms  2.18 GB  compute   42 TF  res~1e-7
# ├── cov_gemm  10.49 ms  2.05 GB  compute   50 TF
# ├── eigh       0.12 ms  0.00 GB  compute    3 TF
# └── transform  2.57 ms  2.18 GB  compute   13 TF

# Pareto-optimal GEMM variants for a 4Kx4Kx4K matmul on H200:
for v in info.pareto("gemm", shape=(4096, 4096, 4096), device="H200"):
    print(v)
# Variant('gemm_fp16'           : 0.2 ms  residual~8e-04)
# Variant('gemm_tf32'           : 0.4 ms  residual~3e-04)
# Variant('gemm_3xfp16'         : 0.5 ms  residual~2e-04)
# Variant('gemm_fp16_x3_kahan'  : 0.6 ms  residual~5e-07)
# Variant('gemm_ozaki2_cute'    : 0.8 ms  residual~2e-15)

Benchmarks

All results below are measured on a single NVIDIA H200 (SM90, 150 GB HBM3e) withCUDA 13\.0, driver580\.126,PyTorch 2\.11,Triton 3\.6, againstcuML 25\.10. Every cell is the median over 5 iterations with the first call discarded for JIT amortization; inputs are GPU-resident on both sides; matched-algorithm rows (samealgorithm,method,svd\_solver) are paired with reduced-precision and algorithmic-shortcut rows so the comparison is fair at every shape.

1. Breadth: speedup over cuML across 13 primitives

The first benchmark is a broad sweep: 13 primitives × 194 (shape, dtype, hyperparameter) cells, all run againstcuml 25\.10on the same H200.Every cell here is an apples-to-apples comparison: matched algorithm, matched precision, matched hyperparameters — FlashLib is forbidden from using any reduced-precision GEMM (no bf16/fp16/Ozaki) or algorithmic shortcut (no Halko, no FFT t-SNE, no NN-Descent KNN).Because of this, the bars below are strictlylowerthan the headline numbers at the top of the post: the hero stats are FlashLib’s best ceiling speedups on each primitive (which, where applicable, do let the user trade precision or algorithmic exactness for throughput via thetolknob from Principle 03), whereas the broad sweep deliberately removes those degrees of freedom to isolate the pure kernel-engineering win. The bar chart below collapses each primitive’s 8–34 cells into a single bar —Geomeanshows the geometric-mean speedup across all of that primitive’s cells, andMaxshows the per-primitive ceiling on the most favourable cell. FlashLib is at least as fast as cuML on193 / 194cells;126cells cross 5×, and11cross 50×.

SpotlightClick one of the five primitive buttons to pin it to the top, then switch between geomean and max to watch the ordering and bar lengths move smoothly.

2. Depth: precision × runtime trade-off inside one primitive (GEMM)

The second benchmark zooms into a single primitive, GEMM at 4096³ on H200, to show what tolerance routing looks like in practice. FlashLib ships ~10 GEMM variants inflashlib\.linalg\.gemm— bf16, fp16, tf32, fused multi-pass (3xbf16,fp16\_x3\_kahan), and the Ozaki-II INT8 family (ozaki2\_cute,ozaki2\_int8) — all behind the singletol-routed dispatcher from Principle 03. The scatter below plots each variant on RMS relative error (vs an FP64 reference) against per-call runtime. The dashed curve is thePareto frontier: variants below it dominate the rest. The interesting point isozaki2\_cute\(s=8\): it sits belowandto the left offp32, meaning it is simultaneously tighter and ~2× faster than the native fp32 path on this shape.

10⁻²10⁻³10⁻⁴10⁻⁵10⁻⁶10⁻⁷0.20.51.02.03.0Runtime per GEMM at 4096³ (ms, log)RMS relative error vs FP64 reference (log)dominated region ↑bf16fp16tf323xbf16fp16_x3_kahanozaki2_cute (s=8)tighter AND faster than fp32fp32native · dominated~2× faster, ~3× tighter

3. Agent loops: Does FlashLib accelerate the agent loop?

Our third benchmark returns to the core motivation of FlashLib: many agentic workflows are bottlenecked not only by LLM rollout, but also by auxiliary operators such as retrieval, clustering, vector search, and numerical routines. We therefore evaluate FlashLib in a meta setting: whether giving a coding agent access to FlashLib helps it ship a faster end-to-end system.

Concretely, we task Claude Code Opus 4.7 with building a GPU vector-search backend, based on the GPU port ofKCORES/vector-db-bench, under a 1M-token budget. The goal is to maximize QPS subject to a strict recall constraint of at least 0.999 on SIFT-1M. We evaluate the resulting system across three workloads: offline batch search, online single-query search, and streaming search. We run the same task twice, changing only whether the agent has access to FlashLib.

Offline batch (10k queries / call)

Bulk-vector queries with no latency constraint. The defining benchmark for fused matmul+top-k.

w/ flashlibdefault

050k100k150k200k250k300kQPS at recall ≥ 0.9990s1000s2000s3000s4000sElapsed time (Claude Opus 4.7, 1M-token budget)37k QPS · Switched to cuVS brute_forceRAFT-fused distance + top-k48k QPS · Enabled TF32 matmul~4× faster cuBLAS path50k QPS · Budget exhausted at 1.07M tokensstuck at 50k QPS — no further ideas162k QPS · Adopted flashlib.flash_knnTriton fused distance + top-k310k QPS · Tiled batching, plateau at 310k QPSall 10k queries in one launch · exited under budget Online single (one query / call)

Per-query launches, so kernel launch overhead dominates. Both variants saturate around the same ceiling.

w/ flashlibdefault

00.5k1k1.5k2k2.5k3kQPS at recall ≥ 0.9990s1000s2000s3000s4000sElapsed time (Claude Opus 4.7, 1M-token budget)0.5k QPS · Naive cuVS path for bsz=1RAFT fused kernel wastes launch on 1 query2.6k QPS · Dispatched torch path for bsz≤645× jump after eliminating cuVS overhead1.9k QPS · Per-query flash_knnlaunch overhead dominates this regime2.6k QPS · Trimmed per-call Python overheadtorch fallback for bsz=1 not needed Streaming (small slabs, hard 0.999 recall)

Tie-breaks against the bench’s cupy ground truth dominate. Hybrid fast/slow paths unlock the high-QPS region.

w/ flashlibdefault

01k2k3k4k5k6kQPS at recall ≥ 0.9990s1000s2000s3000s4000sElapsed time (Claude Opus 4.7, 1M-token budget)0.9k QPS · cupy argpartition pathmatches bench GT byte-for-byte1.0k QPS · Slow incremental tuningnever tried a hybrid fast/slow split1.0k QPS · First config that holds 0.999 recallflash_knn alone tie-breaks ≠ cupy GT4.3k QPS · Hybrid pathflash_knn fast + cupy tie-break for ties5.0k QPS · Reduced cupy fallback rate≥95% of queries skip the slow path Offline batch

310kw/ flashlib

50kdefault

6.2×

Online single

2.6kw/ flashlib

2.6kdefault

≈ 1×

Streaming

5.0kw/ flashlib

1.0kdefault

5.2×

On the two settings where flashlib unlocks a structural win (offline batch and streaming), the flashlib-prompted agent reaches5–6× higher QPSby directly adoptingflash\_knninstead of laboriously rediscovering the cuVS path and slowly bolting on TF32. On online single, where per-query launch overhead dominates either way, both variants converge to roughly the same ceiling. Just as importantly, the flashlib-prompted agent finishes the tasknaturallyat 904k tokens with budget to spare; the default agent runs out at 1.07M tokens with no further ideas on hand — it hit a 50k QPS plateau on the dominant offline-batch task and could not escape it within the budget.

Example: Flash-KMeans in a few lines

FlashLib ships on PyPI and installs with a single command. The smart dispatcher inflashlib\.primitives\.kmeanstakes a GPU tensor of shape\(N, D\)or\(B, N, D\)and returns the cluster IDs and centroids; it picks between the Triton path, the Hopper-only CuteDSL FA3-style fused-assign path, and a pure-torch fallback automatically based on the shape.

$ pip install flashlib
import torch
from flashlib.primitives.kmeans import flash_kmeans

# 1M points, 128 dims on a single H200 -- runs in a few ms.
x = torch.randn(1_000_000, 128, device="cuda", dtype=torch.float32)
labels, centroids = flash_kmeans(x, n_clusters=1024, max_iters=20)
# labels:    (1, 1_000_000) int64   -- cluster assignment per point
# centroids: (1, 1024, 128) float32 -- final cluster centers

If you want to know what the call will costbeforelaunching it — useful for an agent budgeting a pipeline of ten primitives — the same shape goes through the pure-stdlib cost API from Principle 04:

import flashlib.info as info

est = info.estimate("kmeans",
                    shape=(1_000_000, 128),
                    params={"K": 1024, "max_iters": 20})
print(est.summary_line())
# kmeans  ~7 ms  bound=compute  ~140 TF  (64% peak)  [calibrated]

Other primitives follow the same shape:flash\_knn,flash\_pca,flash\_hdbscan,flash\_tsne,flashlib\.gemm, and the rest of the catalog are all one-call drop-ins with the sametol/backendknobs and the sameflashlib\.info\.estimate\(\.\.\.\)cost surface.

Limitations and Future Work

FlashLib’s current release covers an important but incomplete slice of the classical-ML operator landscape — primarily the clustering, nearest-neighbour, dimensionality-reduction, and dense linear-algebra primitives most central to our own agentic-AI workloads. Extending the catalog is one of our top priorities; upcoming releases will deepen coverage of the existing families and add new ones (Gaussian mixtures, kernel methods, graph-based primitives, sparse inputs). A second limitation is hardware coverage: development and benchmarking have so far concentrated on H200. The dispatcher and kernels are written to be portable, but broader measurement and tuning across a wider range of hardware is still needed.

Acknowledgements

FlashLib was deeply inspired by several open-source efforts that pioneered the “flash” design playbook on which this work is built:FlashAttention,FlashDecoding, NVIDIA’scuVSandcuML,FlashLinearAttention, andFlashInfer. We are also grateful toDacheng Li(UC Berkeley),Shiyi Cao(UC Berkeley),Melissa Pan(UC Berkeley), andZihao Ye(University of Washington) for many helpful discussions, and to theflash-kmeansandSparse VideoGencommunities for valuable feedback on early prototypes.

Citation

If FlashLib is useful in your research or product, please cite the project as:

@misc{yang2026flashlib,
  title  = {FlashLib: Bringing Flash Magic to Classical Machine Learning Operators},
  author = {Yang, Shuo and Xi, Haocheng and Zhao, Yilong and Mang, Qiuyang and
            Wang, Zhe and Sun, Shanlin and Keutzer, Kurt and Gonzalez, Joseph E. and
            Han, Song and Xu, Chenfeng and Stoica, Ion},
  year   = {2026},
  url    = {https://flashml-org.github.io/},
}

Similar Articles

Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

Hugging Face Daily Papers

Flash-GMM introduces a fused Triton kernel for Gaussian Mixture Models that achieves 20x speedup and enables training on datasets 100x larger on a single GPU, making soft clustering a viable drop-in replacement for k-means in approximate nearest neighbor search.