@cosimorulli1: Happy to share that our recent work, TACHIOM, got integrated into the PyLate ecosystem! https://arxiv.org/pdf/2604.2814…

X AI KOLs Following 06/11/26, 02:28 PM Papers

Summary

TACHIOM, a multivector retrieval system with token-aware clustering and hierarchical indexing, has been integrated into the PyLate ecosystem. It achieves up to 247x faster clustering and 9.8x retrieval speedup over state-of-the-art systems while maintaining comparable effectiveness.

Happy to share that our recent work, TACHIOM, got integrated into the PyLate ecosystem! https://arxiv.org/pdf/2604.28142 (@SilvioMartinico, @fmnardini, @rventurini_ )

Original Article

View Cached Full Text

Cached at: 06/12/26, 02:51 AM

Happy to share that our recent work, TACHIOM, got integrated into the PyLate ecosystem! https://arxiv.org/pdf/2604.28142 (@SilvioMartinico, @fmnardini, @rventurini_ )

Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing

Source: https://arxiv.org/html/2604.28142

Abstract.

Multivector retrieval models achieve state-of-the-art effectiveness through fine-grained token-level representations, but their deployment incurs substantial computational and memory costs. Current solutions—based on the well-knownκ\kappa-means clustering algorithm—group similar vectors together to enable both effective compression and efficient retrieval. However, standardκ\kappa-means scales poorly with the number of clusters and dataset size, and favours frequent tokens during training while underrepresenting rare, discriminative ones. In this work, we introduceTachiom, a multivector retrieval system that exploits token-level structure to significantly accelerate both clustering and retrieval. By accounting for tokens’ distribution during centroid allocation,Tachiomeasily scales to millions of centroids, enabling highly accurate document scoring using only centroids, avoiding expensive token-level computation.Tachiomcombines a graph-based index over centroids with an optimized Product Quantization layout for efficient final scoring. Experiments onMs Marco-v1andLoTTEshow thatTachiomachieves up to247×247\timesfaster clustering thanκ\kappa-means and up to9.8×9.8\timesretrieval speedup over state-of-the-art systems while maintaining comparable or superior effectiveness.

Multivector Retrieval, Late Interaction, Clustering, Efficiency.

1.Introduction

Transformer-based language models(Devlinet al.,2019; Vaswaniet al.,2017; Karpukhinet al.,2020; Formalet al.,2021; Lassanceet al.,2024)have fundamentally reshaped Information Retrieval (IR), shifting the paradigm from lexical-based to representation-based retrieval. Within this landscape, multivector models, such as ColBERT(Khattab and Zaharia,2020; Santhanamet al.,2022b), have emerged as a gold standard for effectiveness. By encoding documents as sets of token-level vectors and computing relevance via late interaction, i.e., allowing every query token to interact with every document token, these models capture fine-grained semantic nuances that single-vector dense retrievers often miss. However, the real-world deployment of multivector models remains severely constrained by their computational demands. Storing and searching large sets of token vectors creates massive memory footprints and computational overhead. To mitigate this, state-of-the-art solutions adopt agather-and-refinestrategy combined with token quantization(Engelset al.,2023; Santhanamet al.,2022a; Nardiniet al.,2024; Scheereret al.,2025; Bianet al.,2025). In the gather phase, a lightweight search over token vectors identifies promising candidates; in the refine phase, full MaxSim scores are computed only for the selected candidates. Both phases critically depend on a coarse quantizer—a set of centroid vectors that approximate the token embedding space. During gathering, centroids serve as efficient proxies for locating relevant tokens. For compression, each token vector is decomposed into a centroid assignment and a residual vector, which is then compressed using more precise Vector Quantization (VQ) techniques, like, for instance, Product Quantization(Jégouet al.,2011)(PQ). This two-level scheme substantially improves approximation: centroid assignments distinguish distant vectors at minimal cost, while residual compression captures local nuances.

ColBERTv2(Santhanamet al.,2022b)pioneered this centroid-based approach, compressing the residual (elementwise difference between the vector and the centroid) rather than the vector itself.Plaid(Santhanamet al.,2022a)extended this framework with centroid interaction and inverted-file filtering over clustered token embeddings for efficient retrieval. Subsequent work refined this framework:Emvb(Nardiniet al.,2024)introduced bitvector prefiltering to accelerate candidate selection and optimized PQ variants(Geet al.,2014; Fanget al.,2022)to improve redisuals’ compression, achieving significant speedups overPlaid.Warp(Scheereret al.,2025)combinedPlaid’s centroid interaction withXtr’s lightweight architecture(Leeet al.,2023); the authors showed that their method also generalize toColBERTv2embeddings.Igp(Bianet al.,2025)proposed building a graph index on the set of centroids to avoid exhaustive centroids scoring.

Across all these methods, both retrieval quality and compression effectiveness scale directly with the number of centroids: finer-grained centroids enable more precise approximations and more selective gathering. However, standardκ\kappa-means clustering—the predominant method for centroids selection—creates a severe bottleneck: it scales poorly with both dataset size and number of clusters, and critically, it ignores the token structure of multivector embeddings. Training even moderately large sets of centroids (e.g.,262262K) on multivector collections can require tens of hours on multi-core CPUs or expensive GPU resources, fundamentally limiting achievable granularity. In this paper, we address this bottleneck by exploiting a key structural property of multivector embeddings:token identity. Unlike generic vector sets, multivector collections exhibit strong correlations between token type and embedding distribution. Common tokens (e.g., stopwords) dominateκ\kappa-means’ loss during training, consuming the majority of centroids, while rare domain-specific tokens—often more discriminative for retrieval—receive minimal representation. We introduceTachiom, a multivector retrieval system that exploits this structure at both the clustering and retrieval stages. At its core,Tachiomemploys Token-Aware Clustering (Tac), which explicitly allocates centroids based on token frequency and semantic variance. By partitioning the clustering workload across token groups,Tacachieves significantly faster centroids computation over standardκ\kappa-means. This efficiency enables scaling to millions of centroids, far beyond prior work. Leveraging these high-resolution centroids,Tachiomperforms gathering using only centroid-level interactions via anHnswgraph index(Malkov and Yashunin,2020), so as to bypass expensive token-level computations entirely. For refinement, a cache-optimized PQ layout tailored for late interaction enables efficient final scoring.

In summary, the novel contributions of this work are:

•We proposeTac, a clustering algorithm that leverages token identity to decompose the global clustering problem into independent per-token subproblems, improving scalability while producing centroids better aligned with retrieval objectives. We further derive a theoretical lower bound on the computational speed-up achieved byTacoverκ\kappa-means.
•We introduceTachiom, a retrieval architecture that leverages the clustering granularity enabled byTacto perform centroids-only gathering via graph-based search and efficient full scoring with late interaction-tailored PQ.
•We conduct comprehensive experiments onMs Marco-v1andLoTTE, demonstrating that our proposed architectures offer superior efficiency-effectiveness trade-offs. Our results show thatTacachieves up to247×247\timesfaster training than standardκ\kappa-means, whileTachiomdelivers up to9.8×9.8\timesfaster end-to-end retrieval compared to state-of-the-art systems. To favor reproducibility, we release our implementation ofTacandTachiomin Rust on GitHub athttps://github.com/TusKANNy/tachiom.

2.Methodology

2.1.Token-Aware Clustering

Multivector models such asColBERT(Khattab and Zaharia,2020)represent each document as a set of contextualized token embeddings. State-of-the-art retrieval methods(Santhanamet al.,2022a,b; Nardiniet al.,2024; Scheereret al.,2025; Bianet al.,2025; Fanget al.,2022)rely on centroids to approximate these large collections of token vectors, typically usingκ\kappa-means clustering to select representative centroids. However, standardκ\kappa-means treats token embeddings as a generic vector set, discarding the structural information provided by token identity. Token-Aware Clustering (Tac) explicitly exploits the token structure of multivector representations.Tacserves two purposes: (i) it significantly accelerates clustering by partitioning the workload into independent per-token subproblems, crucial for the massive vector collections produced byColBERT-like models; (ii) it balances centroid allocation toward rare tokens, which are underrepresented by standardκ\kappa-means despite their importance for retrieval accuracy.

κ\kappa-means aims to find a set of clusters𝒞={C1,…,Cκ}\mathcal{C}=\{C_{1},\dots,C_{\kappa}\}that partitions the dataset and minimizes the Within-Cluster Sum of Squares (WCSS) or Inertia, i.e.:

∑i=1κ∑𝐭∈Ci‖𝐭−𝐜i‖2,\sum_{i=1}^{\kappa}\sum_{\mathbf{t}\in C_{i}}\|\mathbf{t}-\mathbf{c}_{i}\|^{2}\ ,where𝐜i\mathbf{c}_{i}denotes the centroid of clusterCiC_{i}. This objective treats all vectors equally, but multivector collections exhibit severe frequency imbalance. Common tokens (e.g., stopwords) may appear millions of times, while domain-specific terms have way lower frequencies (hundreds or few thousands of occurrences). Analysing twoColBERTv2multivector collections, we observed that the top 100 most frequent tokens account for over 40% of all vectors, specifically 41% onMs Marco-v1and 45% onLoTTE-pooled. Consequently, the WCSS loss is dominated by high-frequency tokens, which consume the majority of centroids despite contributing little to retrieval quality. Meanwhile, rare tokens—which often carry the most discriminative semantic signal(Sparck Jones,1988)—receive minimal representation, degrading approximation quality precisely where it matters most.

Drawing from classic Information Retrieval literature, whereInverse Document Frequency(IDF) has long been employed to favor rare, discriminative terms(Robertson,2004; Saltonet al.,1975), we proposeTacto address this imbalance in the clustering context.Tacdecouples centroid allocation from uniform WCSS minimization by explicitly weighting tokens based on their frequency and semantic variance, so as to favor rare tokens.Tacfirst computes a weight for each token type representing its share of the total centroid budget, then performs independentκ\kappa-means clustering for each token type.

Tacdistributes the centroid budget through a four-stage pipeline: (1)Tail Handlingisolates very rare tokens to prevent their complete marginalization; (2)Damped Scoringcomputes an allocation weight for the remaining tokens; (3)Boundingenforces strict limits to avoid both under- and over-allocation; and (4)Budget Reconciliationadjusts the final assignments to exactly match the global centroid budget.

Phase 1: Tail Handling. We partition token types into three categories based on their frequencynn:Micro tokens(n<μn<\mu) receive 1 centroid each,Small tokens(μ≤n<τ\mu\leq n<\tau) receive 2 centroids each, andActive tokens(n≥τn\geq\tau) are allocated centroids dynamically based on their frequency and semantic variance. This fixed allocation prevents the complete marginalization of extremely rare tokens.

Phase 2: Damped Scoring. For active tokens, we compute aspread measuresjs_{j}that quantifies semantic variance; we adopt component-wise variance for computational efficiency, i.e.,

sj≔1nj∑i=1nj‖𝐭j,i−𝐭¯j‖2,s_{j}\coloneqq\frac{1}{n_{j}}\sum_{i=1}^{n_{j}}\|\mathbf{t}_{j,i}-\bar{\mathbf{t}}_{j}\|^{2}\ ,wherenjn_{j}is the frequency of tokenjj,𝐭j,i\mathbf{t}_{j,i}is theii-th occurrence of tokenjj, and𝐭¯j\bar{\mathbf{t}}_{j}is its mean embedding. We then compute adamped weight,wj=nj⋅sj\text{w}_{j}=\sqrt{n_{j}}\cdot s_{j}. The square-root damping applies diminishing returns to highly frequent tokens, preventing them from monopolizing centroids budget; the spread coefficientsjs_{j}favors tokens with high semantic variance. The number of centroidsκj\kappa_{j}of tokenjjis allocated proportionally to its weight,

κj=⌊wj∑i=1NTwi⋅B⌋,\kappa_{j}=\left\lfloor\frac{w_{j}}{\sum_{i=1}^{N_{T}}w_{i}}\cdot B\right\rfloor\ ,whereNTN_{T}is the number of different tokens in our collection andBBis the remaining centroid budget after tail handling.

Phase 3: Bounding. We enforce a hard floorε\varepsilonsuch thatκj≥ε\kappa_{j}\geq\varepsilonfor all active tokens to guarantee minimum representation. Additionally, we impose an upper bound to prevent over-allocation: each token must satisfynjκj≥θ\frac{n_{j}}{\kappa_{j}}\geq\theta, ensuring at leastθ\thetavectors per centroid on average. This dual bounding prevents both under-representation of rare tokens in low-budget regimes and over-allocation to already well-approximated tokens in high-budget regimes.

Phase 4: Budget Reconciliation. We redistribute surplus or deficit centroids to exactly match the target budgetκ\kappa. Once centroids are allocated, we perform independentκj\kappa_{j}-means clustering for each tokenjjwith its assignedκj\kappa_{j}centroids.

In summary,Tacfundamentally differs from standardκ\kappa-means in two key aspects: 1) it prioritizes retrieval quality over pure reconstruction error: by allocating centroids based on token frequency and semantic variance rather than uniform WCSS contribution,Tacensures that rare, discriminative tokens receive adequate representation in terms of assigned centroids; 2) instead of optimizing a single global objective over allNNvectors withκ\kappacentroids,Tacsolves independent subproblems for each token, clustering thenjn_{j}vectors of tokenjjintoκj\kappa_{j}centroids. This decomposition substantially reduces the per-iteration computational cost and enables effective parallelization, as each token can be processed independently, whereas standardκ\kappa-means requires shared access to the full dataset. Moreover, the damped centroid allocation mitigates the bottleneck caused by highly frequent tokens, further improving both total cost and parallel efficiency.

Theoretical Lower Bound for Speed-up. Standardκ\kappa-means requiresO(I⋅N⋅κ⋅d)O(I\cdot N\cdot\kappa\cdot d)operations forIIiterations overNNvectors withκ\kappacentroids indddimensions.Tacreduces this toO(I⋅∑j=1NTnj⋅κj⋅d)O(I\cdot\sum_{j=1}^{N_{T}}n_{j}\cdot\kappa_{j}\cdot d).

We can provide a minimum theoretical speedup ofTacoverκ\kappa-means. Ignoring floor operations and tail handling (Phase 1) for clarity, we have a total centroids budgetκ\kappaand the resulting number of centroids assigned to tokenjjbecomesκj=wj∑i=1NTwi⋅κ\kappa_{j}=\frac{w_{j}}{\sum_{i=1}^{N_{T}}w_{i}}\cdot\kappa. Thus, the speedup achieved byTacoverκ\kappa-means is:

κN∑j=1NTκjnj\displaystyle\frac{\kappa N}{\sum\limits_{j=1}^{N_{T}}\kappa_{j}n_{j}}\=κN∑j=1NT(wj/∑i=1NTwi)κnj=κ(N∑j=1NTwj)κ(∑j=1NTwjnj)\displaystyle=\ \frac{\kappa N}{\sum\limits_{j=1}^{N_{T}}(w_{j}/\sum\limits_{i=1}^{N_{T}}w_{i})\kappa n_{j}}\ =\ \frac{\kappa\left(N\sum\limits_{j=1}^{N_{T}}w_{j}\right)}{\kappa\left(\sum_{j=1}^{N_{T}}w_{j}n_{j}\right)}≥N∑j=1NTwjmaxj=1NT⁡wj∑j=1NTnj=N(∑j=1NTwj)(maxj=1NT⁡wj)N=∑j=1NTwjmaxj=1NT⁡wj.\displaystyle\geq\ \frac{N\sum\limits_{j=1}^{N_{T}}w_{j}}{\max\limits_{j=1}^{N_{T}}w_{j}\sum\limits_{j=1}^{N_{T}}n_{j}}\ =\ \frac{N\left(\,\sum\limits_{j=1}^{N_{T}}w_{j}\right)}{\left(\max\limits_{j=1}^{N_{T}}w_{j}\right)N}\ =\ \frac{\sum\limits_{j=1}^{N_{T}}w_{j}}{\max\limits_{j=1}^{N_{T}}w_{j}}\ . That is, the speedup is lower-bounded by the ratio of the total weight to the maximum single token weight. Because the cumulative weight across the vocabulary vastly exceeds that of any single token, this formulation guarantees a substantial baseline speedup. Furthermore, this bound reveals a key insight: our damped scoring mechanism not only balances centroid allocation toward rare tokens, but actively suppresses the dominance of the most common ones. By applying square-root scaling to frequencies,Tacdirectly reduces the maximum single-token weight, strictly improving the speedup guarantee and preventing these dominant tokens from creating a computational bottleneck.

Residuals Compression. After computing the centroids, we compress the residuals, i.e., the difference between each token vector in the dataset and its assigned centroid, using PQ(Jégouet al.,2011). Because residual magnitudes may vary across tokens due to the non-uniform centroid allocation ofTac, we normalize the residual vectors and store their norms separately. The resulting normalized residuals are more homogeneous and therefore easier to compress effectively.

2.2.Hierarchical Index for Optimized Multivector Retrieval

The speedups achieved byTacoverκ\kappa-means allow us to significantly scale up the number of centroids used. This fine-grained clustering significantly improves the approximation provided by centroids alone, thus eliminating the need for full token scoring (centroid + residual) during the gather phase, differently from existing approaches(Nardiniet al.,2024; Bianet al.,2025; Scheereret al.,2025). Furthermore, by decoupling the gather phase from the residual scores calculation, we can specialize the refinement stage with a Product Quantization layout optimized specifically for document-levelMaxSimMaxSimevaluation. Building on these observations, we introduceTachiom(Token-Aware Clustering and Hierarchical Indexing for Optimized Multivector retrieval), a retrieval architecture that combines anHnswproximity graph over centroids for efficient gathering with a cache-optimized PQ layout for efficient refinement.Tachiomoperates in two main phases: first, it explores centroids to select a set of candidate documents; in the second phase, it uses both centroids and residuals to compute the fullMaxSimMaxSimscores of the candidate documents and provide the final ranking.

Index Structure. We build anHnsw(Malkov and Yashunin,2020)graph over the centroid set; each centroid𝐜i\mathbf{c}_{i}maintains an inverted listℒi\mathcal{L}_{i}containing the document IDs of all tokens assigned to𝐜i\mathbf{c}_{i}. This design enables document-level scoring during the gather phase without accessing individual token vectors.

Gather Phase: Centroid-based Document Scoring. For each query token𝐪i\mathbf{q}_{i}of a queryqq:

(1)we traverse the graph to identify the top-κc\kappa_{c}nearest centroids;
(2)for each retrieved centroid𝐜j\mathbf{c}_{j}, we iterate over its inverted listℒj\mathcal{L}_{j};
(3)for documents appearing in multiple lists, we retain only the highest centroid similarity:s~i(d)=max{j:d∈ℒj}⁡⟨𝐪i,𝐜j⟩\tilde{s}_{i}(d)=\max_{\{j\,:\,d\in\mathcal{L}_{j}\}}\langle\mathbf{q}_{i},\mathbf{c}_{j}\rangle;
(4)we accumulate partial scores across all query tokens:S~(q,d)=∑i=1nqs~i(d)\tilde{S}(q,d)=\sum_{i=1}^{n_{q}}\tilde{s}_{i}(d).

This produces an approximateMaxSimMaxSimscore using only centroid-level interactions. Documents appearing in inverted lists for multiple query tokens naturally accumulate higher scores, capturing the multi-token matching signal without decompressing token vectors. Before the next phase, we truncate the candidate set to the topκd\kappa_{d}documents by partial scoreS~(q,⋅,)\tilde{S}(q,\cdot,). We also apply adaptive filtering to further reduce the candidate set, adopting theCandidates Pruning(CP) strategy proposed in(Martinicoet al.,2026), regulated by a parameterα\alpha.

Refer to caption Figure 1.Distance tables layout with 4 centroids per subspace. Evaluating codej=2j=2for subspacem=2m=2, document tokenii.Refine Phase: PQ-Optimized𝐌𝐚𝐱𝐒𝐢𝐦\mathbf{MaxSim}Computation. For the pruned candidate set, we compute fullMaxSimMaxSimscores using both centroids and PQ-compressed residuals. We adopt specialized data layouts optimized for cache locality and SIMD vectorization.

Document Layout: For each documentdd, we store all tokens’ centroid IDs contiguously, followed by all PQ codes:[𝐜1d,…,𝐜ndd∣PQ11,…,PQndM][\mathbf{c}_{1}^{d},\ldots,\mathbf{c}_{n_{d}}^{d}\mid\text{PQ}_{1}^{1},\ldots,\text{PQ}_{n_{d}}^{M}], wherendn_{d}is the number of tokens in documentddandMMis the number of PQ subspaces. This enables streaming evaluation: we first accumulate all centroid contributions in one pass, then refine with residuals in a second pass.

Distance Table Layout: PQ requires precomputing distance tables between query tokens and PQ centroids for each PQ subspace. A naive multivector PQ implementation processes each query token independently, resulting in a separate distance table per token. However, because a document token’s quantization is fixed, allnqn_{q}query tokens must evaluate their distance against thesamePQ centroid for any given subspace. Consequently, the standard memory layout necessitatesnqn_{q}scattered memory accesses acrossnqn_{q}disjoint tables, creating a severe memory-bandwidth bottleneck. To eliminate this overhead and explicitly exploit the shared access pattern inherent to the MaxSim operator, we reorganize the precomputed distances into a cache-optimized three-level hierarchy:

•Macro-blocks: indexed by PQ subspace (MMblocks forMMsubspaces);
•Centroid blocks: indexed by PQ centroid ID (2b2^{b}blocks forbb-bit codes);
•Query token micro-blocks: containingnqn_{q}consecutive distances between each query token and the same PQ centroid on that subspace.

When evaluating a document token’s PQ codejjfor subspacemm, our layout enables retrieving the distances to allnqn_{q}query tokens as a single contiguous micro-block (Figure1). This avoids thenqn_{q}scattered memory lookups of the standard layout, yielding up to a3.8×3.8\timesspeedup in residual distance computation compared to a standard PQ implementation.

3.Experiments

We perform experiments onMs Marco-v1(Nguyenet al.,2016)(8.8M passages, 598M token vectors, 6,980 dev.small queries) for in-domain evaluation andLoTTE-pooled(Santhanamet al.,2022b)(2.4M passages, 266M token vectors, 2,931 search/dev queries) for out-of-domain evaluation. We useColBERTv2(Santhanamet al.,2022b)as the encoder with a maximum of 180 tokens per document forLoTTE. We evaluate the retrieval results using the standard metrics for each dataset: MRR@10 forMs Marco-v1and Success@5 forLoTTE. Experiments run on an Intel Xeon Silver 4314 CPU @ 2.40GHz with 64 threads. Our implementation is written in Rust1.92.0-nightlyon top of thekANNolo(Delfinoet al.,2025)library. Clustering exploits all 64 threads in parallel, while retrieval experiments are executed sequentially on a single core.

3.1.Clustering Evaluation

We first evaluateTac’s efficiency and effectiveness onMs Marco-v1against twoκ\kappa-means baselines:Faiss(Douzeet al.,2024), which is one of the most widely-adoptedκ\kappa-means implementations, andFastKMeans-rs, an efficient Rust implementation of theFastKMeanslibrary(Clavié and Warner,2025). Since our implementation useskANNolo’s AVX2-optimized single-vector distance kernels without production-ready external libraries such as Intel MKL(Intel Corporation,2024), we evaluate twoFaissconfigurations: generic AVX2 (comparable to our setup) and MKL-enabled.

ForTac, we set tail handling thresholdsμ=128\mu=128andτ=256\tau=256, and bounding parametersε=4\varepsilon=4andθ=39\theta=39based on preliminary experiments. We compress residuals using PQ withM=32M=32subspaces and 8-bit codes per subspace. The number ofκ\kappa-means iterations is set to1010for both standardκ\kappa-means andTac.

Clustering time measurements include (i) centroids computation and (ii) centroid assignments for all vectors in the dataset. The sample size forκ\kappa-means is set to 10M vectors; we assume a 24-hour budget for algorithms’ running time. We assess clustering quality by measuring retrieval effectiveness when using the compressed representations to rerank candidates. In order to isolate vector approximation from index approximation, we do not rely onTachiomfor retrieval. Following the methodology in(Martinicoet al.,2026), we rerank the top 1,000 documents retrieved bySpladeCoCondenser(Formalet al.,2022)using the RerankIndex with all optimizations disabled—specifically removing early exit and pruning mechanisms—and return the top 10 results for each query. This approach is designed to emulate an exhaustive search while maintaining feasible query times, allowing us to precisely assess our centroids’ approximation quality.

The results of this evaluation are reported in Figure2. As shown in the bottom plot,Tacachieves training times orders of magnitude lower than all competing methods, even when they benefit from Intel MKL acceleration. Specifically,Tacyields speedups up to84×84\times(Faisswith MKL),247×247\times(Faiss), and230×230\times(FastKMeans-rs) over the competitors. Our method clusters nearly 600 million 128-dimensional vectors into 262K clusters in just 8 minutes, potentially benefiting prior methods such asPlaid,Emvb,Warp, andIgp, which all rely on this same centroid budget. Furthermore,Tacscales seamlessly to over 4M centroids in only 102 minutes, while the 24-hour timeout prevents the competitors to surpass131131K centroids (FaissandFastKMeans-rs) and524524K (Faisswith MKL).

The top plot of Figure2compares the MRR@10 achieved byTacandκ\kappa-means onMs Marco-v1across different centroid budgets (both with normalized residuals for a fair comparison). We report the MRR@10 achieved by exhaustive search overColBERTv2as reference(MacAvaney and Tonellotto,2024). This plot highlights thatTacnot only offers substantial speedups but also delivers equal or superior approximation quality at a fixed centroid budget, confirming that our allocation strategy effectively improves centroid assignment and validating the benefits of token-aware design for retrieval effectiveness.

Refer to caption Figure 2.Ms Marco-v1. Top plot: MRR@10 for 1,000 candidates reranking. Bottom plot: clustering time in minutes.

3.2.Retrieval Evaluation

We evaluate the efficiency-effectiveness trade-off offered byTachiomby comparing it to three state-of-the-art(Martinicoet al.,2026)multivector retrieval methods:Emvb(Nardiniet al.,2024),Warp(Scheereret al.,2025)andIgp(Bianet al.,2025). We report the best (lowest) average query time forTachiomand all selected competitors at different metrics cut-off points.

ForTachiom, we set the number of centroids to221≈22^{21}\approx 2M forLoTTEand222≈42^{22}\approx 4M forMs Marco-v1. TheHnswgraphs over centroids are built withM=32M=32neighbors andefc=1500ef_{c}=1500. At search time, we perform grid search over: number of retrieved centroids per query tokenκc∈{15,20,40,80,100,120}\kappa_{c}\in\{15,20,40,80,100,120\}, maximum candidate documentsκd∈{250,500,1000,2000,4000}\kappa_{d}\in\{250,500,1000,2000,4000\}, and candidate pruning thresholdα∈{0.35,0.4,0.45,0.5}\alpha\in\{0.35,0.4,0.45,0.5\}, withHnswsearch parameterefs=32κcef_{s}=\frac{3}{2}\kappa_{c}. For competitors, we follow the grid search procedures from(Martinicoet al.,2026). All the methods use the same residuals size, i.e.3232bytes per token vector. This is achieved with PQ32 or its variants forTachiomandEmvb, and with22bits per component for the Scalar Quantization employed byWarpandIgp.

Table 1.Average query time (ms). “−-” indicates the technique does not reach the effectiveness level.Ms Marco-v1(MRR@10)LoTTE(Success@5)39.039.339.667.568.068.5Warp989.8×\times987.0×\times–494.4×\times––Igp727.2×\times––484.3×\times––Emvb555.5×\times564.0×\times573.8×\times544.9×\times553.9×\times592.5×\timesTachiom101415111423Our results are reported in Table1.Tachiomexhibits superior efficiency across all metric cutoffs, achieving speedups ranging from2.5×2.5\timesto9.8×9.8\timescompared to the baselines.Tachiomis up to5.5×5.5\timesfaster than the best available data structureEmvband matches its peak effectiveness despite employing standard Product Quantization (PQ).Emvb, conversely, relies on computationally expensive PQ variants, namely JMPQ(Fanget al.,2022)—a supervised method that jointly trains centroids, PQ, and the query encoder—onMs Marco-v1and OPQ(Geet al.,2014)onLoTTE. These results demonstrate thatTachiom, through its careful centroid selection and ability to scale centroids viaTac, rivals the effectiveness of both JMPQ and OPQ without incurring their additional training overhead.

4.Conclusions and Future Work

In this work, we presentedTac, a clustering algorithm that decomposes the global clustering problem into independent per-token subproblems and allocates centroids to tokens based on their frequency and semantic variance, achieving up to247×247\timesspeedup overkk-means and improving effectiveness. On top of this, we developedTachiom, a retrieval architecture that leverages high-granularity centroids to efficiently gather candidates, bypassing expensive token-level interactions.Tachiomachieves a9.8×9.8\timesspeedup over state-of-the-art methods. Together, these contributions unlock the practical utility of multivector retrieval for large-scale datasets. Future work will assess the generalizability ofTacbeyondColBERTv2and evaluateTachiomon datasets substantially larger thanMs Marco-v1. Additionally, by leveraging the superior approximation quality ofTaccentroids, we intend to investigate higher residual compression ratios.

Acknowledgements

This research was conducted as part of the ENRESFOOD project, funded under the FutureFoodS partnership. The project is co-funded by the Austrian Science Fund (FWF; 10.55776/KIN4661525), The Dutch Ministry of Agriculture, Fisheries, Food Security and Nature (BO-43-219-026), the Italian Ministry of University and Research (MUR), the German Federal Ministry of Research, Technology and Space (031B1717), and the European Union’s Horizon Europe research and innovation programme via FutureFoodS, the European Partnership for a Sustainable Future of Food Systems, co-funded under Grant Agreement No. 101136361.

References

Z. Bian, M. L. Yiu, and B. Tang (2025)IGP: efficient multi-vector retrieval via proximity graph index.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp. 2524–2533.External Links:DocumentCited by:§1,§1,§2.1,§2.2,§3.2.
B. Clavié and B. Warner (2025)Fastkmeans: accelerated kmeans clustering in pytorch and triton.Note:https://github.com/AnswerDotAI/fastkmeans/Cited by:§3.1.
L. Delfino, D. Erriquez, S. Martinico, F. M. Nardini, C. Rulli, and R. Venturini (2025)KANNolo: sweet and smooth approximate k-nearest neighbors search.InProceedings of the 47th European Conference on Information Retrieval,pp. 400–406.Cited by:§3.
J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019)BERT: pre-training of deep bidirectional transformers for language understanding.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp. 4171–4186.External Links:DocumentCited by:§1.
M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou (2024)The FAISS library.arXiv preprint arXiv:2401.08281.Cited by:§3.1.
J. Engels, B. Coleman, V. Lakshman, and A. Shrivastava (2023)DESSERT: an efficient algorithm for vector set search with vector set queries.InProceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS),Cited by:§1.
Y. Fang, J. Zhan, Y. Liu, J. Mao, M. Zhang, and S. Ma (2022)Joint optimization of multi-vector representation with product quantization.InProceedings of the 11th CCF International Conference Natural Language Processing and Chinese Computing,pp. 631–642.External Links:ISBN 978-3-031-17119-2,DocumentCited by:§1,§2.1,§3.2.
T. Formal, C. Lassance, B. Piwowarski, and S. Clinchant (2021)SPLADE v2: sparse lexical and expansion model for information retrieval.arXiv preprint arXiv:2109.10086.Cited by:§1.
T. Formal, C. Lassance, B. Piwowarski, and S. Clinchant (2022)From distillation to hard negative sampling: making sparse neural ir models more effective.InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp. 2353–2359.External Links:DocumentCited by:§3.1.
T. Ge, K. He, Q. Ke, and J. Sun (2014)Optimized product quantization.IEEE Trans. Pattern Anal. Mach. Intell.36(4),pp. 744–755.External Links:DocumentCited by:§1,§3.2.
Intel Corporation (2024)Intelő oneapi math kernel library (onemkl).External Links:LinkCited by:§3.1.
H. Jégou, M. Douze, and C. Schmid (2011)Product quantization for nearest neighbor search.IEEE transactions on pattern analysis and machine intelligence33,pp. 117–28.External Links:DocumentCited by:§1,§2.1.
V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020)Dense passage retrieval for open-domain question answering.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),pp. 6769–6781.External Links:DocumentCited by:§1.
O. Khattab and M. Zaharia (2020)ColBERT: efficient and effective passage search via contextualized late interaction over bert.InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval,pp. 39–48.External Links:ISBN 9781450380164,DocumentCited by:§1,§2.1.
C. Lassance, H. Déjean, T. Formal, and S. Clinchant (2024)SPLADE-v3: new baselines for SPLADE.arXiv preprint arXiv:2403.06789.Cited by:§1.
J. Lee, Z. Dai, S. M. K. Duddu, T. Lei, I. Naim, M. Chang, and V. Y. Zhao (2023)Rethinking the role of token retrieval in multi-vector retrieval.InProceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS),Cited by:§1.
S. MacAvaney and N. Tonellotto (2024)A reproducibility study of PLAID.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp. 1411–1419.External Links:DocumentCited by:§3.1.
Y. A. Malkov and D. A. Yashunin (2020)Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs.IEEE Trans. Pattern Anal. Mach. Intell.42(4),pp. 824–836.External Links:ISSN 0162-8828,DocumentCited by:§1,§2.2.
S. Martinico, F. M. Nardini, C. Rulli, and R. Venturini (2026)Multivector reranking in the era of strong first-stage retrievers.arXiv preprint arXiv:2601.05200.External Links:2601.05200,LinkCited by:§2.2,§3.1,§3.2,§3.2.
F. M. Nardini, C. Rulli, and R. Venturini (2024)Efficient multi-vector dense retrieval with bit vectors.InProceedings of the 46th European Conference on Information Retrieval (ECIR),pp. 3–17.External Links:DocumentCited by:§1,§1,§2.1,§2.2,§3.2.
T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016)MS MARCO: A human generated machine reading comprehension dataset.InProceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches,Vol.1773.External Links:LinkCited by:§3.
S. E. Robertson (2004)Understanding inverse document frequency: on theoretical arguments for idf.J. Documentation60,pp. 503–520.External Links:LinkCited by:§2.1.
G. Salton, A. Wong, and C. S. Yang (1975)A vector space model for automatic indexing.Commun. ACM18(11),pp. 613–620.External Links:ISSN 0001-0782,Link,DocumentCited by:§2.1.
K. Santhanam, O. Khattab, C. Potts, and M. Zaharia (2022a)PLAID: an efficient engine for late interaction retrieval.InProceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM),pp. 1747–1756.External Links:DocumentCited by:§1,§1,§2.1.
K. Santhanam, O. Khattab, J. Saad-Falcon, C. Potts, and M. Zaharia (2022b)ColBERTv2: effective and efficient retrieval via lightweight late interaction.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp. 3715–3734.External Links:DocumentCited by:§1,§1,§2.1,§3.
J. L. Scheerer, M. Zaharia, C. Potts, G. Alonso, and O. Khattab (2025)WARP: an efficient engine for multi-vector retrieval.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp. 2504–2512.External Links:DocumentCited by:§1,§1,§2.1,§2.2,§3.2.
K. Sparck Jones (1988)A statistical interpretation of term specificity and its application in retrieval.InDocument Retrieval Systems,pp. 132–142.External Links:ISBN 0947568212Cited by:§2.1.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017)Attention is all you need.InProceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS),pp. 6000–6010.Cited by:§1.

Antoine Chaffin (@antoine_chaffin): Whether you are GPU poor or GPU rich, today’s release of PyLate has something for you! GPU maxxers: MaxSim kernels greatly speed up training while lowering the memory requirements CPU enjoyers: TACHIOM enables lightning fast multi-vector indexing and search directly on CPU

@cosimorulli1: Happy to share that our recent work, TACHIOM, got integrated into the PyLate ecosystem! https://arxiv.org/pdf/2604.2814…

Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing

Abstract.

1.Introduction

2.Methodology

2.1.Token-Aware Clustering

2.2.Hierarchical Index for Optimized Multivector Retrieval

3.Experiments

3.1.Clustering Evaluation

3.2.Retrieval Evaluation

4.Conclusions and Future Work

Acknowledgements

References

Similar Articles

@jino_rohit: over the last 6-8 months, ive been trying to move towards the ml systems and ai infra space. these are some of my favor…

@antoine_chaffin: Whether you are GPU poor or GPU rich, today's release of PyLate has something for you! GPU maxxers: MaxSim kernels grea…

@topk_io: https://x.com/topk_io/status/2065172828161200563

@SilvioMartinico: The late-interaction multivector retrieval ecosystem is exploding right now. To help separate the signal from the noise…

@antoine_chaffin: The new generation of open state-of-the-art single and multi-vector retrieval models is here It's time, DenseOn with th…

Submit Feedback

Similar Articles

@jino_rohit: over the last 6-8 months, ive been trying to move towards the ml systems and ai infra space. these are some of my favor…

@antoine_chaffin: Whether you are GPU poor or GPU rich, today's release of PyLate has something for you! GPU maxxers: MaxSim kernels grea…

@topk_io: https://x.com/topk_io/status/2065172828161200563

@SilvioMartinico: The late-interaction multivector retrieval ecosystem is exploding right now. To help separate the signal from the noise…

@antoine_chaffin: The new generation of open state-of-the-art single and multi-vector retrieval models is here It's time, DenseOn with th…