@mixedbreadai: https://x.com/mixedbreadai/status/2071678747439505816
Summary
Mixedbread AI introduces asymmetric quantization for late interaction retrieval, achieving 32x storage reduction with minimal quality loss by storing document vectors as binary signs while keeping query vectors high-precision, making late interaction practical for billion-scale production systems.
View Cached Full Text
Cached at: 06/30/26, 03:48 PM
Asymmetric Quantization: Near-Lossless Late interaction Retrieval with 97% Storage Reduction
TL;DR: We built a multimodal and multilingual search engine using late interaction. In order to serve it, we built our own object storage based vector database. Here we talk about asymmetric quantization, a method which allows us to save 32x storage while loosing almost no performance.
Late interaction models like Wholembed v3 make retrieval much more precise, because they preserve fine-grained information instead of compressing a whole document into one vector. But they also change the storage economics. A single document produces more then one embedding, depending on the complexity of the document it can produce hundreds or thousands of vectors. Each vector has to be stored and later used for retrieval.
Mixedbread Search runs on silo, our retrieval engine for multimodal late interaction at billion-document scale. Silo stores vectors for more than 2.5 billion documents in object storage and hydrates them into faster tiers as queries need them. At that scale, every extra byte per document is repeated billions of times, and it shows up directly in cost per stored document, shard cold-start time, and the bytes each query has to read. We need to work around the tradeoff making the whole system cheap while maining the quality which makes late interaction worth running.
This post walks through asymmetric quantization. One of the optimizations that makes running late interaction practical in production. We keep the query vectors at higher precision and store the document vectors as binary signs. In our internal benchmark suite that cuts raw document-vector storage on average by 32x from 393 KiB to 12.28 KiB per document, while holding retrieval quality at 89.65 NDCG@10 versus 90.26 for fp32.
Quantization: Making Multi-Vector Storage Practical
Quantization means representing high-precision floating point vectors with lower-precision values such as int8, or even 1-bit signs. The goal is to preserve ranking quality while reducing payload size. This matters especially for silo. Object storage gives us durable, low-cost persistence. In order to make it suitable for real workloads, we need compact indexes to serve it fast enough. And on the document side, payload size is what dominates the cost.
Naive late interaction is expensive because it stores more vectors. A standard single-vector embedding with 3072 dimensions in fp32 takes 12 KiB per document. A multi-vector representation with 786 vectors of 128 dimensions carries much more information, but uncompressed it is about 33x larger.
Storage numbers here refer to raw vector payloads only. Production indexes also include document IDs, metadata, and layout overhead.
Storage numbers here refer to raw vector payloads only. Production indexes also include document IDs, metadata, and layout overhead.
With binary document vectors, a 786-token multi-vector document is only about 2% larger than a 3072-dimensional fp32 single vector. Which means, that you can pay roughly single-vector storage and get late interaction quality. This helps us to change the tradeoff. Late interaction becomes practical to run by default, instead of something reserved for cases that justify the storage.
This is not a new direction for late interaction, ColBERTv2 showed that aggressive compression can reduce the footprint of late interaction models while preserving quality. PLAID showed that late interaction retrieval can be engineered down to practical latency using optimized retrieval and pruning. For production systems, both lessons matter: the model has to be precise, and the representation has to be cheap enough to move through hardware.
Why Asymmetric Quantization
Compressing the document vectors saves storage, IO, cache space, and cold-start time across the entire corpus. Compressing the query vectors saves almost nothing because the query is small, short-lived, and never stored in the index.
This is also why we do not binarize both sides. Fully binary retrieval is the most compact option, but dropping the query to single bits throws away the magnitude information the ranking depends on, and it costs far more quality than binarizing documents alone (as shown later).
So we keep the query in int8 and store only the document vectors as binary signs. The query stays precise enough to preserve ranking, while the document side gets the storage reduction that matters for serving.
The Scoring Trick
Binary document vectors are smaller and thus cheaper to store.
For int8 x int8 scoring, modern ARM CPUs give us direct support through NEON dot-product instructions. Our AArch64 kernel uses SDOT to accumulate sixteen int8 multiplications into int32 lanes, then horizontally reduces the result with vaddvq_s32. For int8 x binary scoring, the useful identity is simpler. If each document dimension is stored as a sign bit, with b_i in {-1, +1}, then:
So, scoring does not require a full multiply for every dimension. We need the sum of query values selected by the positive document bits, and the total query sum.
In the kernel, document signs are packed into bits. For the 128-dimensional path, we precompute query sums and eight query bit-planes. Each document token is loaded as 16 packed bytes, shifted and masked into eight 0/1 masks with NEON integer operations, then scored with SDOT against the query planes. The final score uses the identity above: 2 * selected_query_sum - query_sum.
Binary x binary is even cheaper computationally since it can use hamming distance, but the quality loss is too large for our main retrieval path.
Retrieval Quality
We evaluated several precision pairings across our internal retrieval benchmark suite. Scores are NDCG@10 averaged across the suite, scaled to 0–100. NDCG@10 (Normalized Discounted Cumulative Gain at rank 10) measures how well the top 10 results are ordered against the ideal ranking, rewarding relevant documents more when they appear higher, with 100 being a perfect ranking. The full-precision baseline averages 90.26. Int8 query against binary documents averages 89.65, a 0.61 point drop, while reducing document-vector storage by 32x. Part of the minimal performance drop, is that Wholembed v3 is trained with silo’s tradeoff in mind, so it is robust to the quantization.
For runtime, we benchmarked the scoring kernel on an ARM machine with a 33 × 128 query over a list of 1000 documents, each 786 × 128. The table reports median latency across 9 measured runs after 2 warmups, plus speedup relative to the fp32 baseline.
There are two useful operating points.
If we want maximum quality with lower bandwidth and faster integer scoring, int8 × int8 is essentially lossless in this setup. It is slightly ahead of the fp32 baseline within measurement noise, while cutting document storage by 4x and running 3.2x faster in this benchmark.
If we want the best storage economics, int8 × binary is the more interesting point. It keeps most of the ranking quality while shrinking document vectors by 32x and running 3.8x faster than fp32. For an object-storage-backed system, that is a direct cut in corpus-side bytes.
Binary × binary looks appealing on paper. It uses the same 12.28 KiB of document storage as int8 × binary, and at 4.3x it is the fastest option here. But it drops 7.20 points against the baseline, more than ten times the int8 × binary drop, despite reading exactly the same document bytes. The only thing that changed is the query. In practice, it removes too much query signal.
What This Unlocks
Asymmetric quantization works because retrieval systems do not pay for query and document precision in the same way. The document vectors dominate the long-term cost of the system: they are stored, replicated, cached, evicted, rehydrated, and scored repeatedly. The query vectors are short lived, so spending a few more bits on the query and saving bits on every stored document is the right tradeoff.
For silo, this makes late interaction retrieval much easier to serve at large scale. A lower cost per stored document, faster shard cold-start, and a hardware path that spends more time scoring and thus allowing higher qps and less time moving bytes around. This allows us to get the quality of multi-vector representations without treating every document as a large fp32 object.
If this is the kind of systems problem you want to work on, we are hiring.
Similar Articles
@RuiTheBaker: just witnessed how mixedbread turned my bucket into a late-interaction database.
Mixedbread announces that users can now bring their own cloud bucket, enabling zero-retention indexing and search with late-interaction models.
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs
Mix-Quant proposes a phase-aware quantization framework for agentic LLMs, using NVFP4 quantization for the prefilling stage to accelerate computation while preserving BF16 precision for decoding to maintain accuracy. The method achieves up to 3x speedup in prefilling with minimal performance degradation on agentic benchmarks.
LLM Compression with Jointly Optimizing Architectural and Quantization choices
Researchers from UiT and University of Oslo propose a differentiable NAS framework that jointly optimizes architectural configurations and mixed-precision quantization for LLM compression, achieving up to 1.4× faster inference or 6% higher accuracy across seven reasoning tasks compared to sequential NAS-then-quantization baselines.
@SpaceTimeViking: I have one version that maintain BF16 Attention layers, and another mixed precision quant with NVFP4 weights and FP8 At…
A mixed-precision quantization of Google's Gemma-4-12B-it model using NVFP4 for MLP weights and FP8 for attention layers, achieving 25% smaller footprint and faster throughput while maintaining quality.
Inner Product Aware Quantization: Provably Fast, Accurate, and Adaptive Algorithms
This paper introduces inner product aware quantization methods that preserve inner products with unseen vectors, developing fast and adaptive algorithms with provable guarantees, achieving 2-10x speedup over prior ASQ methods.