@Andy_ShuoYang: FlashLib update: we now support ANN search with IVF-Flat — up to 6.5× faster than cuVS on real-world vector workloads (…
Summary
FlashLib updates to support ANN search with IVF-Flat, achieving up to 6.5× faster performance than cuVS on real-world vector workloads. LEANN now integrates FlashLib as a backend, offering substantial speedups in build and search operations.
View Cached Full Text
Cached at: 06/05/26, 05:11 AM
FlashLib update: we now support ANN search with IVF-Flat — up to 6.5× faster than cuVS on real-world vector workloads (SIFT-1M) while matching recall.
LEANN now supports FlashLib as a backend: 26× faster build, 29× faster single-query, and 298× faster batch search. Huge thanks to @YichuanM for the help!
We’re also opening Discord / Slack channels — join us to suggest new operators you want to see, and hardware backends you want FlashLib to support next!
Slack: https://join.slack.com/t/flashml/shared_invite/zt-3zpdh5j10-9dwTXrgLiqpVxizhA9KVbA…
Discord: https://discord.gg/ce5Xa5pf
Similar Articles
@Andy_ShuoYang: Flash-KMeans was only the beginning. Today, from the Flash-KMeans team, we are releasing FlashLib — a GPU library for f…
The Flash-KMeans team releases FlashLib, a GPU library for classical ML operators that achieves up to 208x speedups over cuML on Hopper GPUs, with a focus on fast, predictable performance for agentic AI workloads.
@neural_avb: Deep learning bros and sisters, don't sleep on this. You can cluster millions of documents in embedding space, mass-ann…
Shuo Yang and team release FlashLib, a GPU library that accelerates classical ML operators like KMeans, KNN, HDBSCAN, PCA, and t-SNE, claiming speedups up to 208x.
@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …
A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.
@davideciffa: Huge thanks to @csujun, now Luce DFlash is 10-15% faster, by implementing per-layer K/V truncation in the draft graph f…
Luce DFlash has achieved a 10-15% speedup by implementing per-layer K/V truncation in the draft graph for SWA layers.
@vintcessun: Compressing 10 million vectors from 31GB to 4GB, with search even faster than FAISS — sounds crazy, but Turbovec actually did it. The core is Google's TurboQuant data-independent quantization: no training, no parameter tuning, just add vectors and index. Handwritten NEON/AVX-512 implementations are genuinely 12-20% faster, supporting filtered search by ID, saving a ton of post-processing hassle. Rust under the hood + pip install, minimal maintenance cost.
Turbovec, based on Google's TurboQuant algorithm, compresses 10 million vectors from 31GB to 4GB, with search speed 12-20% faster than FAISS, supports filtered search, and offers a Rust implementation with a Python package.