Inside FAISS: Billion-Scale Similarity Search

Hacker News Top Tools

Summary

Educational article explaining FAISS, a library for billion-scale similarity search, covering vector embeddings, nearest neighbor search, and techniques like IVF and Product Quantization for efficient retrieval.

Author here. I wrote this as a visual companion to the 2017 FAISS paper (<a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1702.08734" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1702.08734</a>), focused on the parts I found hardest to grok from text alone.<p>The article covers a subset of what FAISS does, with the paper as the source of truth. NSG, FastScan, IMI are not covered here, they&#x27;ll get their own articles. I&#x27;d be especially interested in feedback on:<p>- the IVFPQ &#x2F; IVFADC explanation, particularly the LUT reuse argument<p>- whether the GPU part captures enough of the actual complexity<p>Happy to answer questions.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:10 PM

# Inside FAISS: Billion-Scale Similarity Search Source: [https://fremaconsulting.ch/blog/faiss](https://fremaconsulting.ch/blog/faiss) ## 1\. Everything is a Vector In modern AI, images, text, and audio are not understood by computers as "cat" or "symphony"\. Instead, they are converted into lists of numbers called**embeddings**or**vectors**\. These vectors live in a high\-dimensional geometric space\. The core idea is simple: items that are semantically similar are placed close together in this space\. Foundations · Vectors & similarity ### 1\. The Challenge Computers don't have eyes or ears\. They only understand numbers\. How do we explain the concept of a 'Dog' or a 'Cat' to a machine? ### 2\. Vectorization We use Neural Networks to transform raw data \(pixels, text\) into a list of numbers\. This list is the 'Embedding'\. It captures the semantic meaning\. ### 3\. The Vector Space These numbers act as coordinates \(X, Y, Z\.\.\.\)\. We can plot every image or word as a single point in a high\-dimensional space\. ### 4\. Similarity is Distance In this space, concepts that are similar are placed close together\. 'Cat' is close to 'Dog', but far from 'Car'\. We measure this using distance \(Euclidean or Cosine\)\. Above, we visualize a simplified 2D slice of this space\. Imagine this in 1024 dimensions\. Finding the "nearest neighbor" for a query point is equivalent to finding the most similar item in your database\. ## 2\. The NN Problem at Scale Formally, given a queryand a databaseofvectors, exact nearest\-neighbor search solves: The naive**Brute Force**solution evaluates every distance explicitly\. Cost per query:time andmemory to hold the database in full precision\. ForSIFT descriptors\(, 4 bytes per float\), that is 512 GB of RAM and one billion distance evaluations per search, a non\-starter for real\-time systems like web\-scale retrieval or live LLM memory\. The rest of this article explores the two escape routes FAISS exploits: - **Partitioning**: skip most ofat query time \(IVF\)\. - **Compression**: make each comparison cheap, and make the database fit in RAM \(Product Quantization\)\. Production systems combine both\. ## 3\. FAISS to the Rescue FAISS \(Facebook AI Similarity Search\) introduces**approximate**search methods\. By sacrificing a tiny bit of accuracy \(maybe you get the 2nd best match instead of the 1st\), we can speed up search by orders of magnitude\. Let's explore two key indexing strategies: - **Flat:**The baseline brute\-force\. Slow but accurate\. - **IVF \(Inverted File\):**Partitions the space into "Voronoi cells"\. We only look in the most promising cells\. Try it yourself\. Increase the dataset size and see how different indexes perform\. ### Interactive Index Benchmark Compare the performance of different indexing algorithms on random datasets\. Dataset Size \(N\) Index Type ## 4\. Partitioning with IVF The**Inverted File \(IVF\)**index works like a library classification system\. Instead of looking at every book to find a specific title, you go to the "Science Fiction" section first\. Mathematically, it uses K\-Means clustering to partition the vector space into**Voronoi cells**\. When we search, we first identify which cell our query vector falls into \(using a "coarse quantizer"\), and then we*only*calculate distances for vectors inside that cell \(and perhaps a few neighboring ones\)\. ### IVF Clustering1\. Raw Data K\-means partitions random vectors into the Voronoi cells that IVF searches\. **Start:**Here is our raw dataset\. 80 vectors in a 2D space\. Searching this linearly \(Brute Force\) would require calculating distance to 80 points\. ## 5\. Compressing with Product Quantization IVF makes search fast by skipping most of the database, but leaves every vector uncompressed\. One billion SIFT descriptors still cost 512 GB of RAM\.**Product Quantization**\(PQ\), introduced by[Jégou, Douze and Schmid \(2011\)](https://hal.inria.fr/inria-00514462/document), is the compression trick FAISS builds on to shrink each vector to 8 bytes while keeping distance estimates meaningful\. ### Same centroids, a different job [§4](https://fremaconsulting.ch/blog/faiss#sec-4)used centroids to*partition*the search space\. PQ uses them to*compress*each vector\. Same K\-Means math, new purpose: the centroid's*index*becomes the vector\. If your codebook hascentroids, any vector collapses to one integer in\. ### A pixel\-sized analogy A 24\-bit RGB pixel can be any of 16\.7 million colors\. A GIF gives up some of that range: it picks a 256\-color**palette**and replaces every pixel with an 8\-bit index into it\. Storage drops 3×, the picture still looks like the picture\. That palette is a codebook; the index is a code\. A 128\-D SIFT descriptor is just a very long pixel\. We want the same trick: find a palette of representative vectors, replace every descriptor with its nearest palette index\. ### How many bits does an index cost? To labelcentroids uniquely you need enough bits to distinguish all of them\. Each bit doubles the number of labels, so the code length isbits\. Concretely: - centroids →bits per code \(1 byte\) - centroids →bits - centroids → 64 bits \(8 bytes\) More centroids means the quantizer discriminates finer detail, and the code grows accordingly\. The codebook designer is playing a bit\-budget game\. ### The paper's aggressive target Jégou et al\. want each 128\-D SIFT descriptor to become a**64\-bit code**, which is**0\.5 bit per dimension**, a 64× compression ratio against the 512\-byte original\. Half a bit per dimension is ambitious: it means the quantizer has to encode all the variation along each axis with a single binary\-ish choice\. Hitting 64 bits with one flat codebook means pickingcentroids\. That is roughly 18 quintillion, more than the number of grains of sand on Earth\. Let's feel that number before we dismiss it\. ### Flat codebook feasibility Pick a vector dimension and a bit budget per dimension to see the centroid count and storage footprint a single flat quantizer would need\. Total bits per code64bits Centroids needed \(k = 2^64\)1\.84 × 10^19centroids Codebook storage9\.44ZB Impossible Bigger than all cloud storage on Earth\. Lloyd's algorithm would need roughly 5\.53 × 10^20training samples\. ### Why 264is not a codebook Three walls hit simultaneously whengets that large: - **Storage\.**Each centroid is 128 floats × 4 bytes = 512 bytes\. Timescentroids gives about**9\.4 zettabytes**, more than all cloud storage on Earth in 2025\. You cannot persist the codebook, let alone page through it at query time\. - **Training data\.**Lloyd's algorithm needs at least tens of samples per centroid to converge\. That istraining vectors\. No dataset in existence is that large\. - **Query cost\.**To encode one descriptor you compare it to every centroid\. 18 quintillion distance computations per vector is not a latency you can ship\. The flat codebook dies on all three fronts\. The paper's move is to keep the*effective*vocabulary the same size while shrinking the stored codebook by many orders of magnitude\. ### The factoring trick Back to the GIF analogy\. Instead of finding one 256\-color palette that works for the whole image, cut the image into 8 tiles and give each tile its own 256\-color palette\. Each tile is now one byte; the image is 8 bytes; every tile had access to a full 256\-color palette tuned to its own pixels\. PQ does exactly this for vectors\. Formally: splitintosub\-vectorsand learn one small sub\-quantizerper block\. The product quantizer is the tuple of their outputs: Withandcentroids per block, the*effective*codebook haspossible combinations \(same order of magnitude as the impossible flat case\), but we only storecentroid vectors ineach\. Total codebook storage is\(paper[§II\.B](https://hal.inria.fr/inria-00514462/document), Table I\), small enough to live in L2 cache\. Each encoded vector fits in, one byte per block index\. Ratio: 512 B to 8 B, a**64×**drop in memory, with a codebook that actually fits on the machine\. The walkthrough below builds the full index on the paper's reference SIFT setting \(\), turning each concept above into a concrete operation on real numbers\. Compression · Product Quantization ### 1\. One descriptor, zoomed all the way in A 128\-D vector is 128 floats\. Zoom into a single one and you find IEEE 754: 1 sign bit, 8 exponent bits, 23 mantissa bits\. Four bytes per dimension, 512 bytes per vector\. ### 2\. Cut it into 8 sub\-vectors Same 128 numbers, regrouped into 8 contiguous blocks of 16 dimensions each\. Each block gets its own colour because each one will get its own codebook\. ### 3\. Now multiply: a tiny demo database A codebook needs a population, not a single vector\. This demo uses 4 example vectors; the paper trains on millions\. Every database vector gets the same 8\-block split, and each block contributes one point per vector to its own subspace\. ### 4\. Focus on one codebook: u₁ Each subspace gets its own codebook, trained alone\. That independence is the product in Product Quantization\. To see the whole recipe, we zoom into just one subvector, u₁, rendered in 2\-D for readability \(the paper's subspaces live in d\* = D/m = 16 dimensions, with D = 128 and m = 8\)\. Four training vectors is a pedagogical cartoon; real PQ trains each subspace on roughly 10⁶ descriptors\. Lloyd's algorithm runs k\-means on this subspace alone: assign every point to its nearest centroid, move each centroid to its cluster mean, repeat until the within\-block distortion MSE\(q₁\) = 𝔼‖u₁\(x\) − q₁\(u₁\(x\)\)‖² stops falling\. The 6 centroids you end with \(k\* = 256 in the paper\) form the codebook C₁\. Encoding any future vector's first block is then a nearest\-centroid lookup: emit the index i of c₁,ᵢ\. ### 5\. Same thing, 8 times in parallel All 8 subspaces are quantized the same way at once, each block over its own slice of the data with its own codebook\. Lloyd's runs per block, independently \(k\* = 256 clusters per codebook in the paper, shown here with 6\)\. ### 6\. Encoding a brand\-new vector A query x arrives at search time, it was never in the training set\. Split it into 8 sub\-vectors and, in each plot, snap to the nearest centroid\. The 8 indices concatenate into the 8\-byte code\. ### 7\. Decoding and per\-block error Reconstruction is a table lookup: replace each index by its centroid\. The gap between the unknown raw point and its centroid is the per\-block quantization error; the total reconstruction error is their sum of squares\. ### 8\. Distance without decoding: SDC vs ADC Two ways to compute query\-to\-database distance from codes\. SDC quantises both points and reads centroid\-to\-centroid distances\. ADC keeps the query in full precision and reads raw\-to\-centroid distances\. ADC has a tighter error bound; FAISS defaults to it\. Once encoded, how do we measure distances without decoding every vector back to its 512 bytes? The paper gives two options\. 1. **SDC**\(symmetric\) quantizes both the query and the database vector and reads a pairwise centroid distance from a precomputedtable per sub\-quantizer\. 2. **ADC**\(asymmetric\) leaves the query in full precision and precomputes onlyquery\-to\-centroid distances per block, then sums one lookup per block: Per\-query cost is similar \(Jégou et al\., Table II\), but**ADC**has a tighter error bound: the mean squared distance error satisfiesfor**ADC**\(Eq 18\) versusfor**SDC**\. FAISS defaults to**ADC**\. ## 6\. Combining: IVFPQ **IVFPQ**\(also called IVFADC in the paper\) stacks the two previous techniques: a coarse quantizer prunes the database to a handful of cells, and a product quantizer compresses what remains to 8 bytes per vector\. One subtle choice makes the combination work: PQ encodes not the raw vector but its**residual**with respect to the coarse centroid\. **Indexing**a database vector: 1. coarse quantize, the nearest of thecoarse centroids; 2. compute the residual; 3. encode the residual with the product quantizer, giving 8 byte codes; 4. append the pairto the inverted list of cell, whereis the database index ofandis the m\-byte PQ code produced in step 3\. Residuals concentrate near zero, so the PQ codebook spends its bits modeling variation the coarse quantizer did not already capture\. Same byte budget, better reconstruction\. **Searching**for a query: 1. find thenearest coarse centroids to; 2. compute the query residualfor each selected cell; 3. precompute thedistance**L**ook\-**U**p**T**able \(**LUT**\)\. Indexselects the PQ sub\-block \(in our setup\), and indexselects one of thecentroids of that sub\-block's codebook \(\)\. Each entry is the squared distance between the query's sub\-block residual and that centroid: 4. scan the inverted list: for each stored entry with PQ codes\(one sub\-block index per\), read theprecomputed distances from the LUT and sum them:table reads,additions per candidate\. No float multiplications, no square roots\. 5. keep a fixed\-capacity max\-heap of sizeholding the best candidates seen so far, keyed by their ADC distance\. While the heap is not full, push every scanned candidate\. Once full, compare each new ADC distance against the root \(the current worst of the best\-\): if smaller, pop the root and push the candidate, otherwise discard it\. The heap is shared across theprobed inverted lists\. When the scan is done, drain the heap in ascending order to get theapproximate nearest neighbors of\. The walkthrough below runs the full pipeline with real arithmetic at every stage\. Phase 1 · Training \(offline\) ### 1\. Train the coarse quantizer Run k\-means on a training set to learncoarse centroids\. They carveinto Voronoi cells, and these are the*neighborhoods*of IVFADC\. Paper setting:for SIFT \(Jégou et al\.[§IV\.A](https://hal.inria.fr/inria-00514462/document)\)\. We render 8 cells here so each one stays readable\. ### 2\. From raw vectors to residuals For every training vector, subtract its nearest coarse centroid: Subtracting the centroid re\-centers each cell's points on the origin of its own frame\. Pooled together, all cells share one centered*residual*distribution, and that is what the product quantizer learns on\. ### 3\. Why residuals quantize better Residuals carry much less energy than raw vectors:on average\. Same centroid budget, smaller region to cover, finer quantization steps\. **Analogy: video compression\.**First send a coarse frame, then spend your bits on the difference\. You encode motion and detail, not the whole picture from scratch\. Toggle the**overlay**in the visual to superimpose both histograms and read the variance ratio\. ### 4\. Train PQ on residuals Split each residual intosub\-vectors\. Run Lloyd's on each block independently, with one codebookper block\. The codebooks model variation the coarse quantizer already subtracted away\. Same 8\-byte footprint, sharper reconstructions\. Training ends here; the next three chapters consume these codebooks at speed\. Phase 2 · Encoding the database ### 5\. Assign the coarse bucket A new database vectorarrives\. Compare it to every coarse centroid, pick the nearest\. Call its index\. This index identifies the inverted listwherewill be stored\. ### 6\. Encode the residual with PQ Compute, split into eight sub\-vectors, snap each to its nearest centroid in\. Emit the 8 block indices\. bits per block, 8 blocks, one byte each\. Eight bytes of code locateinside its inverted list, with no need to store the raw coordinates\. ### 7\. Append to the inverted list Appendto the inverted list\. The raw coordinates ofare*discarded*\. Memory per vector: Click a**Voronoi cell**on the left to inspect the entries stored in its list\. Phase 3 · Search ### 8\. Query arrives, probe w cells Your true nearest neighbour can land near a cell boundary\. To catch it, check theclosest coarse centroids to, not just the top one \(*multiple assignment*\)\. Adjust**w**on the left to probe more cells and see the pruning ratio update live\. ### 9\. Build the ADC lookup tables For each probed cell, compute a query\-specific residual, split it into 8 blocks, then for\(one row per sub\-block\) and\(one column per centroid\), fill 8 × 256 =**2,048**cells per LUT\. Built once per probed cell, reused across every candidate in its inverted list\. ##### Why isconstant across the whole list? The query residual relative to a specific coarse cellis - is the query, fixed for this search\. - is the center of the coarse bucket we are currently scanning, fixed for this specific inverted list\. - Thereforeis identical for every comparison we make within that list\. ##### Why is the LUT reused? For a candidateinside that list we estimate Using the residual decomposition, - is the query's offset from the cell center: same for everyin this list\. - \(with\) is the candidate's offset from the cell center: different for every, but it is just a*selection*of centroids from the codebooks\. For sub\-block, the candidate stores an index, so the per\-block contribution is **Key realization\.**The specific indexchanges from candidate to candidate, but the set of possible centroidsis the same for the whole database, andis the same for the whole list\. So we pre\-compute the distance from the query sub\-vector to*every*possible centroid and store it in the LUT\. When the scan moves to the next candidate, there is no math: we look up the index the candidate already points to \(say\) and read\. > “For each subquantizerthe distances between the partial residual vectorand all the centroidsofare preliminarily computed and stored\.” \(Jégou et al\., 2011\) ComponentStatus during list scanQueryFixedCell centerFixedQuery residualFixedSub\-codebook centroidsFixed \(permanent\)LUT valuesFixed \(computed once at start of list scan\)CandidateChanges \(iterating through the list\)Candidate indicesChanges \(used to look up the fixed LUT values\) ### 10\. Scan: 8 lookups, 7 adds Each stored codeis 8 table indices\. The ADC distance estimate is: No float multiplications, no square roots\. The scan loop is eight random reads from an L1\-resident table plus seven adds per candidate\. That is the inner loop of IVFADC\. ### 11\. Merge top\-K across w lists A fixed\-capacity max\-heap of sizetracks the current best\. Whenever a new estimate beats the heap root, the root is popped and the candidate is pushed\. After scanning alllists, drain the heap in ascending order; those are the approximate nearest neighbours\. ### 12\. Why it scales Scanning onlyentries instead of, with each candidate costing 8 look\-ups and 7 adds\. Training runs once; encoding and search run forever\. Dispatched on GPU with WarpSelect \(Johnson et al\. 2017,[§4\.2](https://arxiv.org/abs/1702.08734)\), the same pipeline handles SIFT1B at**17\.7 µs**per query on a single Titan X\. Assuming**balanced**inverted lists, each query scans only aboutentries instead of\. Jégou et al\. \([§IV\.A](https://hal.inria.fr/inria-00514462/document)\) recommendbetween 1,000 and 1,000,000 for SIFT; their Table V usesandwith\. Withat, that is roughly 7\.8 million scanned entries per query, each costing 8 LUT look\-ups and 7 additions\. On CPU, the 2011 paper reports 8\.8 ms per query on a 1\-million\-vector GIST benchmark with these parameters \(Table V\)\. Billion\-scale CPU latencies are an order of magnitude higher; the sub\-millisecond regime belongs to the GPU implementation of Johnson, Douze and Jégou \(2017\), covered in[§7](https://fremaconsulting.ch/blog/faiss#sec-7)\. ## 7\. Accelerating IVFPQ on GPU Going from CPU to GPU shifts the bottleneck\. The hard question is no longer “how fast can we sum these numbers” but “how fast can we stream thousands of tiny PQ codes from DRAM into the compute units\.” The answer is about**memory bandwidth**, not arithmetic\. Mental model Each inverted list is one aisle in a huge warehouse\. A query is a shopping list split intofeatures\. The LUT is a tiny price sheet, one row per feature\. A CPU sends*one*worker down the aisle; a GPU sends a**warp**: 32 threads moving in lock\-step\. Four tricks turn that crew into a billion\-vector scanner: 1. **Coalesced reads\.**Transpose the list so the warp picks up 32 sub\-codes in one 128\-byte DRAM burst instead of 32 scattered ones\. 2. **Shared\-memory LUT\.**Copy thetable into SRAM once per list\. Every thread then reads it at ~25 cycles instead of ~400 \([§5\.3](https://arxiv.org/abs/1702.08734)\)\. 3. **Warp scan\.**One thread per vector, 32 distances per tick, partial sums held in registers\. 4. **WarpSelect\.**Top\-K without a global lock: warp queue, then block merge, then global merge \([§4\.2](https://arxiv.org/abs/1702.08734)\)\. Step through the four tricks below; click any figure to zoom in on the labels\. ### Memory layout How the inverted list is stored decides how many DRAM transactions a warp costs\. Takeaway\.**Row\-major**puts all codes of one vector together\.**Transposed**groups all codes of one sub\-quantizer together\. With the transpose, 32 threads asking for sub\-q*j*land on contiguous bytes and the memory controller answers with a single 128\-byte transaction\. Scaling past one GPU is an explicit knob \([§5\.4](https://arxiv.org/abs/1702.08734)\):*replication*splits the query stream acrossdevices,*sharding*splits the database acrossdevices, and the two compose toGPUs\. ## 8\. Real\-World Use Cases FAISS isn't just for benchmarks\. Here are six production scenarios where vector search with FAISS powers real applications: #### Semantic Search Find documents by meaning, not just keywords\. Query embeddings retrieve the most relevant passages from millions of documents\. ``` index.search(embed("climate change effects"), k=10) ``` #### Image Similarity Reverse image search and visual deduplication\. Find visually similar products, detect copyright infringement, or organize photo libraries\. ``` index.search(clip_embed(query_image), k=5) ``` #### RAG / LLM Memory Retrieval\-Augmented Generation grounds LLM responses in real data\. FAISS serves as the fast retrieval backbone for context injection\. ``` context = index.search(embed(user_query), k=3) llm.generate(prompt + context) ``` #### Recommendation Systems Item\-to\-item and user\-to\-item recommendations\. Embed user preferences and product features, then find nearest neighbors\. ``` similar = index.search(item_embedding, k=20) ``` #### Document Deduplication Detect near\-duplicate documents at scale\. Hash or embed content, then use range search to find items within a similarity threshold\. ``` lims, D, I = index.range_search(emb, thresh=0.95) ``` #### Anomaly Detection Score outliers by distance to their nearest neighbors\. Points far from any cluster are likely anomalies or novel inputs\. ``` D, _ = index.search(point, k=5) anomaly_score = D.mean() ``` ## 9\. Conclusion Four ideas, layered on top of each other\.**Flat**gives perfect recall at impossible cost\.**IVF**partitions the space so most of it can be skipped at query time\.**PQ**compresses every vector into roughly 8 bytes so a billion of them fit in RAM\.**IVFPQ**combines the two, and on a single GPU the pipeline returns top\-K in microseconds\. The move underneath is always the same: embed your data into a geometry where*close*means*similar*, then pick the index whose trade\-offs match your constraints \(latency, memory, recall\)\. Graph indexes \(HNSW, NSG\) and hardware\-aware accelerators \(FastScan, IMI\) reach the same goal through different tricks and will get their own articles\. Once the geometry and the trade\-offs click, billion\-scale nearest\-neighbor search stops being a research problem and starts being a dial you turn\. ## References

Similar Articles

Comparing Vector search libraries

Reddit r/LocalLLaMA

Benchmarks vector search libraries (Faiss, Scann, Usearch) for speed, memory, and accuracy across dataset sizes from 500 to 1 million samples, with results and code available.

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Hugging Face Daily Papers

The paper introduces Direct Corpus Interaction (DCI), a novel approach allowing AI agents to query raw text directly using standard terminal tools instead of traditional embedding-based retrieval. By bypassing fixed similarity interfaces and offline indexing, DCI significantly outperforms conventional sparse, dense, and reranking baselines across multiple IR and agentic search benchmarks.

dmtrKovalenko/fff

GitHub Trending (daily)

fff is a fast, typo-resistant file search toolkit with frecency ranking and an MCP server for AI agents, providing efficient file and content search with git awareness.