Why Advanced Encoders Lag on Sparse Retrieval? The Answer and an Approach to Bridging Vocabulary Gaps

arXiv cs.AI Papers

Summary

This paper identifies a vocabulary gap as the root cause why advanced encoders like ModernBERT underperform in learned sparse retrieval, and proposes Vocabulary Transfer (VT), a model-agnostic framework that migrates encoders to sparse-friendly vocabularies, achieving state-of-the-art on the BEIR benchmark.

arXiv:2607.00004v1 Announce Type: cross Abstract: While advanced foundation models like ModernBERT significantly outperform older architectures in dense retrieval, they surprisingly lag behind the aging BERT-base baseline in learned sparse retrieval (LSR). We identify the root cause as the \textit{Vocabulary Gap}: modern tokenizers utilize raw, case-sensitive vocabularies designed for lossless reconstruction, which map single semantic units to redundant surface forms, wasting model capacity on morphological noise and hindering lexical matching. We formalize this intuition through a theoretical framework, demonstrating that appropriate vocabulary coarse-graining can tighten the generalization bounds by reducing complexity of the hypothesis class, provided that semantic integrity is preserved. To resolve this, we propose \textbf{Vocabulary Transfer (VT)}, a model-agnostic framework that migrates advanced encoders to sparse-friendly, normalized vocabularies with minimal computational cost. VT utilizes a novel \textbf{Semantic Initialization} via spatial topology to preserve geometric structure and an \textbf{Activation Potential Calibration (APC)} mechanism to align pre-trained manifolds with sparsity constraints, preventing the dead neuron and dense collapse observed in standard fine-tuning. Empirically, VT is universally effective: it enables ModernBERT to achieve state-of-the-art performance on the BEIR benchmark (\textbf{52.4} nDCG, a \textbf{+4.7} improvement), resuscitates failing models like RoBERTa-large, and generalizes seamlessly to inference-free architectures and specialized domains. These results confirm that the performance lag is not an architectural deficiency but a solvable vocabulary mismatch. We've released our code and models.\footnote{https://anonymous.4open.science/r/vocab-transfer/. All details included.}
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:41 AM

# Why Advanced Encoders Lag on Sparse Retrieval? The Answer and an Approach to Bridging Vocabulary Gaps
Source: [https://arxiv.org/html/2607.00004](https://arxiv.org/html/2607.00004)
\(2026\)

###### Abstract\.

While advanced foundation models like ModernBERT significantly outperform older architectures in dense retrieval, they surprisingly lag behind the aging BERT\-base baseline in learned sparse retrieval \(LSR\)\. We identify the root cause as theVocabulary Gap: modern tokenizers utilize raw, case\-sensitive vocabularies designed for lossless reconstruction, which map single semantic units to redundant surface forms, wasting model capacity on morphological noise and hindering lexical matching\. We formalize this intuition through a theoretical framework, demonstrating that appropriate vocabulary coarse\-graining can tighten the generalization bounds by reducing complexity of the hypothesis class, provided that semantic integrity is preserved\. To resolve this, we proposeVocabulary Transfer \(VT\), a model\-agnostic framework that migrates advanced encoders to sparse\-friendly, normalized vocabularies with minimal computational cost\. VT utilizes a novelSemantic Initializationvia spatial topology to preserve geometric structure and anActivation Potential Calibration \(APC\)mechanism to align pre\-trained manifolds with sparsity constraints, preventing the dead neuron and dense collapse observed in standard fine\-tuning\. Empirically, VT is universally effective: it enables ModernBERT to achieve state\-of\-the\-art performance on the BEIR benchmark \(52\.4nDCG, a\+4\.7improvement\), resuscitates failing models like RoBERTa\-large, and generalizes seamlessly to inference\-free architectures and specialized domains\. These results confirm that the performance lag is not an architectural deficiency but a solvable vocabulary mismatch\. We’ve released our code and models\.111[https://anonymous\.4open\.science/r/vocab\-transfer/](https://anonymous.4open.science/r/vocab-transfer/)\. All details included\.

SPLADE, learned sparse representations, passage retrieval

††journalyear:2026††copyright:cc††conference:Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 20–24, 2026; Melbourne, VIC, Australia††booktitle:Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR ’26\), July 20–24, 2026, Melbourne, VIC, Australia††doi:10\.1145/3805712\.3809724††isbn:979\-8\-4007\-2599\-9/2026/07††ccs:Information systems Retrieval models and ranking## 1\.Introduction

The landscape of neural information retrieval has bifurcated into two dominant paradigms: dense retrieval, which encodes queries and documents into continuous low\-dimensional embeddings\(Karpukhinet al\.,[2020a](https://arxiv.org/html/2607.00004#bib.bib25); Xionget al\.,[2021](https://arxiv.org/html/2607.00004#bib.bib26)\), and learned sparse retrieval \(LSR\), which projects text into high\-dimensional, weighted lexical vectors\(Formalet al\.,[2021b](https://arxiv.org/html/2607.00004#bib.bib8); Malliaet al\.,[2021](https://arxiv.org/html/2607.00004#bib.bib27)\)\. While dense retrievers excel at capturing semantic nuances, sparse retrievers—exemplified by models like SPLADE\(Formalet al\.,[2021b](https://arxiv.org/html/2607.00004#bib.bib8)\)—retain the interpretability and efficiency of inverted indices while mitigating the lexical mismatch problem of traditional BM25\(Robertsonet al\.,[1995](https://arxiv.org/html/2607.00004#bib.bib31); Manninget al\.,[2008](https://arxiv.org/html/2607.00004#bib.bib9)\)\.

In the dense retrieval paradigm, upgrading the backbone is a proven strategy\. Modern foundations like ModernBERT\(Warneret al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib30)\)provide not only stronger representations but also architectural advantages like 8k context windows and FlashAttention compatibility\.

![Refer to caption](https://arxiv.org/html/2607.00004v1/x1.png)Figure 1\.TheVocabulary Gapanomaly\. While advanced encoders like ModernBERT significantly outperform BERT in dense retrieval, they lag behind in sparse retrieval under standard fine\-tuning\.However, these architectural leaps remain inaccessible to sparse retrieval\. We observe a puzzling anomaly:advanced encoders consistently underperform in sparse settings, often lagging behind the older BERT\-base\-uncased baseline\.As illustrated in[Figure 1](https://arxiv.org/html/2607.00004#S1.F1), this performance degradation is pervasive\. The most intuitive explanation attributes this to the BPE tokenizer differences in modern models\. However, we observe thatbert\-base\-cased, which uses the same WordPiece tokenizer as the effectivebert\-base\-uncasedbaseline, performs equally poorly\. This isolates the degree of vocabulary normalization as the critical variable\. This regression persists despite identical training pipelines, suggesting that the architectural advancements of modern backbones are stifled by a fundamental incompatibility with the sparse retrieval objective\.

We identify the root cause as thevocabulary gap—specifically, the shift in modern tokenization toward raw vocabularies \(i\.e\., lacking normalization or pre\-tokenization\) designed for lossless reconstruction\. These tokenizers map single semantic units to redundant surface variants \(e\.g\., “Token” vs\. “token”\), forcing the model to waste capacity bridging these orthogonal dimensions—a burden dense models bypass\. While forcing input lowercasing offers partial relief, it is insufficient; aggressive lowercasing on a case\-sensitive tokenizer often fragments tokens \(e\.g\.,Halloween→\\rightarrowhall,ow,een\), destroying semantic integrity\.

Compounding this challenge is the prohibitive cost of remediation\. While training a model from scratch with a sparse\-friendly vocabulary could theoretically solve the issue, it is computationally impractical\. Modern foundation models are trained on massive corpora—ModernBERT, for instance, on 2 trillion tokens\(Warneret al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib30)\)\. Replicating this pre\-training scale simply to swap the vocabulary is infeasible for most applications\. Consequently, the field faces a dilemma: we require the reasoning power and inference efficiency of modern backbones, yet their native vocabularies are ill\-suited for sparse retrieval\.

In this work, we provide the answer to this lag and a method to resolve it\. We argue that sparse retrieval requires aRepresentation\-Compatiblevocabulary—one that normalizes surface forms while preserving semantic distinctions\. We formalize this intuition through a theoretical framework showing that appropriate vocabulary coarse\-graining improves the generalization bound of sparse retrievers by reducing hypothesis class complexity without sacrificing approximation power\.

Guided by this theory, we proposeVocabulary Transfer \(VT\), a recipe to migrate strong pre\-trained backbones to a sparse\-friendly vocabulary withminimal cost—using<0\.2%<0\.2\\%of the original ModernBERT training tokens and achieving near\-optimal performance with just 500 MLM steps\. VT utilizes a novelSemantic Initializationvia spatial topology and anActivation Potential Calibrationmechanism\. This aligns the advanced backbone with the sparsity constraints of models like SPLADE, preventing the “dead neuron” and dense collapse observed in standard fine\-tuning\.

Our contributions are as follows:

- •Theoretical Analysis:We derive a generalization bound for sparse retrieval under vocabulary coarse\-graining, introducingRepresentation Compatibility\(RC\) to explain why normalization improves learnability\.
- •Methodology:We propose VT, a model\-agnostic procedure that transplants regularized vocabularies onto advanced encoders using geometric initialization and discrepancy\-aware adaptation\.
- •Empirical Validation:We demonstrate that VT isuniversally effective\. It enables ModernBERT to achieve state\-of\-the\-art results on BEIR\(Thakuret al\.,[2021](https://arxiv.org/html/2607.00004#bib.bib34)\)\(52\.4 nDCG, a\+4\.7improvement\),resuscitatesfailing models like RoBERTa\-large, and generalizes seamlessly toinference\-freearchitectures anddomain\-specificadaptation\.

## 2\.Related Work

### 2\.1\.Neural Sparse Retrieval

The evolution of information retrieval has seen a transition from exact matching heuristics, such as BM25\(Robertsonet al\.,[2009](https://arxiv.org/html/2607.00004#bib.bib35)\), to neural architectures that learn semantic representations\. While dense retrieval\(Karpukhinet al\.,[2020b](https://arxiv.org/html/2607.00004#bib.bib36); Xionget al\.,[2020](https://arxiv.org/html/2607.00004#bib.bib37)\)encodes queries and documents into continuous low\-dimensional spaces, Learned Sparse Retrieval \(LSR\) projects text into high\-dimensional sparse vectors, preserving the interpretability and efficiency of inverted indices\.

Early LSR approaches focused on estimating term weights or expanding documents with relevant terms\.DeepCT\(Dai and Callan,[2020](https://arxiv.org/html/2607.00004#bib.bib39)\)utilized BERT to predict context\-aware term weights, mapping them back to the bag\-of\-words space\. Similarly,docT5query\(Nogueiraet al\.,[2019](https://arxiv.org/html/2607.00004#bib.bib40)\)employed generative models to expand documents with potential queries\.SparTerm\(Baiet al\.,[2020](https://arxiv.org/html/2607.00004#bib.bib42)\)introduced a gating mechanism to explicitly learn term importance and enforce sparsity\.COIL\(Gaoet al\.,[2021](https://arxiv.org/html/2607.00004#bib.bib78)\)bridged the gap between sparse and dense methods by storing efficient contextualized representations in inverted lists\.

The SPLADE family\(Formalet al\.,[2021b](https://arxiv.org/html/2607.00004#bib.bib8),[a](https://arxiv.org/html/2607.00004#bib.bib11); Lassance and Clinchant,[2022](https://arxiv.org/html/2607.00004#bib.bib47); Lassanceet al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib46)\)represented a paradigm shift by applying sparsity regularization directly on the Masked Language Model \(MLM\) logits, performing simultaneous expansion and weighting\. Recent research has shifted towardsinference\-freearchitectures to reduce query\-side latency\. TILDE\(Zhuang and Zuccon,[2021](https://arxiv.org/html/2607.00004#bib.bib79)\)and subsequent works\(Genget al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib7); Shenet al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib12)\)pre\-compute document representations while keeping query processing lightweight\. However, these models are exposed to theVocabulary Gap, as they lack the capacity to dynamically bridge lexical mismatches between the pre\-trained backbone and the retrieval task\.

### 2\.2\.Pre\-trained Backbones and Tokenization

The efficacy of Pre\-trained Language Models \(PLMs\) is inextricably linked to their tokenization strategies\. Standard architectures like BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2607.00004#bib.bib28)\)utilize WordPiece\(Schuster and Nakajima,[2012](https://arxiv.org/html/2607.00004#bib.bib33)\), while modern backbones such as RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2607.00004#bib.bib29)\)and ModernBERT\(Warneret al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib30)\)rely on BPE\(Sennrichet al\.,[2016](https://arxiv.org/html/2607.00004#bib.bib32)\)\. While techniques like Subword Regularization\(Kudo,[2018](https://arxiv.org/html/2607.00004#bib.bib80)\)and CharacterBERT\(El Boukkouriet al\.,[2020](https://arxiv.org/html/2607.00004#bib.bib81)\)attempt to improve morphological robustness, the rigid distinctness of surface forms in standard subword vocabularies remains a fundamental bottleneck for sparse matching, necessitating significant model capacity to bridge these lexical gaps\.

### 2\.3\.Vocabulary Transfer and Adaptation

Adapting pre\-trained models to new vocabularies is a critical challenge\. This problem has been extensively studied in the context of cross\-lingual transfer, where vocabulary misalignment severely hampers performance\(Artetxeet al\.,[2020](https://arxiv.org/html/2607.00004#bib.bib82)\)\. To address this, various initialization strategies have been proposed to align new vocabularies with pre\-trained manifolds without full retraining\. WECHSEL\(Minixhoferet al\.,[2022](https://arxiv.org/html/2607.00004#bib.bib83)\)used a shared bilingual static embedding space to map target subwords and initializes each new token embedding as a similarity\-weighted average of itskknearest source subword embeddings\. More recently, FOCUS\(Dobler and de Melo,[2023](https://arxiv.org/html/2607.00004#bib.bib24)\)was brought up to transfer the vocabulary from monolingual language model to multilingual\. It leveraged FastText\(Bojanowskiet al\.,[2017](https://arxiv.org/html/2607.00004#bib.bib95)\)to derive similarity relations between the new token and anchor tokens, which are then used to weight the combination\. Mundra et al\.\(Mundraet al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib23)\)provided a comprehensive empirical validation of these strategies, highlighting that leveraging the source embedding structure is crucial for convergence\.

In the specific context of Learned Sparse Retrieval \(LSR\), the impact of vocabulary design is profound yet only recently gaining attention\.Lioniset al\.\([2026](https://arxiv.org/html/2607.00004#bib.bib101)\)empirically confirmed the effect of vocabulary casing on sparse retrieval, andLeiet al\.\([2025](https://arxiv.org/html/2607.00004#bib.bib100)\)explored enhancing lexicon\-based embeddings with LLMs\. Regarding adaptation, ESPLADE\(Dudeket al\.,[2023](https://arxiv.org/html/2607.00004#bib.bib84); Kimet al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib85)\)represents a recent attempt to transfer SPLADE capabilities to new vocabularies\. However, ESPLADE relies on computationally expensive continuous Masked Language Modeling \(MLM\) on large corpus to align the new embedding space\. Unlike these approaches, our work proposes a Representation\-Compatible \(RC\) transfer method that utilizes geometric initialization to close the vocabulary gap with minimal adaptation cost\.

### 2\.4\.Theoretical Analysis of Retrieval Models

Theoretical analyses of retrieval models traditionally focus on probabilistic relevance modeling and term\-weighting schemes such as BM25 within the probabilistic relevance framework\(Robertsonet al\.,[2009](https://arxiv.org/html/2607.00004#bib.bib35); Manninget al\.,[2008](https://arxiv.org/html/2607.00004#bib.bib9)\)\. For modern neural models, most available theory comes from general learning\-theoretic tools rather than IR\-specific analyses\. In particular, Rademacher\-complexity\-based bounds for linear predictors withℓ1\\ell\_\{1\}andℓ2\\ell\_\{2\}constraints provide sharp estimates of sample complexity and margin\-based generalization for sparse linear hypothesis classes\(Kakadeet al\.,[2008](https://arxiv.org/html/2607.00004#bib.bib13)\)\. These results underpin many later analyses of regularization, sparsity, and high\-dimensional learning\. For learned sparse retrieval, existing work has focused largely on empirical or architectural aspects\. To the best of our knowledge, there is still little work that explicitly connects vocabulary design and normalization to capacity measures or sample complexity in LSR\.

## 3\.Theoretical Analysis

We give a unified analysis for sparse retrievers that operate in a shared discrete keyspace \(tokens/terms\) with nonnegative, sparse weights on both sides\. To avoid overloading symbols, we reserveddfor documents and useppfor feature\-space dimensionality throughout this section\.

### 3\.1\.Modeling via RC Coarse\-Graining

While neural sparse retrievers like SPLADE utilize dot\-products between two learned encoders, we analyze the generalization capability of the document encoder by treating the query encoder as generating a distribution of linear weights\. This standard reduction allows us to apply Rademacher complexity analysis to the sparse representation learning problem\.

##### Sparse keyspace\.

LetVVbe a discrete keyspace\. For queryqqand documentdd, letwθ,q,wθ,d∈ℝ≥0Vw\_\{\\theta,q\},w\_\{\\theta,d\}\\in\\mathbb\{R\}\_\{\\geq 0\}^\{V\}be sparse encoder weights\. We consider separable per\-key features

\[uθ​\(q,d\)\]t≜ψ​\(wθ,q​\(t\)\)​ϕ​\(wθ,d​\(t\)\),uθ​\(q,d\)∈\[0,R\]\|V\|,\[u\_\{\\theta\}\(q,d\)\]\_\{t\}\\triangleq\\psi\(w\_\{\\theta,q\}\(t\)\)\\,\\phi\(w\_\{\\theta,d\}\(t\)\),\\qquad u\_\{\\theta\}\(q,d\)\\in\[0,R\]^\{\|V\|\},whereψ,ϕ:ℝ≥0→ℝ≥0\\psi,\\phi:\\mathbb\{R\}\_\{\\geq 0\}\\to\\mathbb\{R\}\_\{\\geq 0\}are non\-decreasing\. Assumeℓ1\\ell\_\{1\}budgets \(encouraged by sparsity regularization\)‖wθ,q‖1≤Sq\\\|w\_\{\\theta,q\}\\\|\_\{1\}\\leq S\_\{q\},‖wθ,d‖1≤Sd\\\|w\_\{\\theta,d\}\\\|\_\{1\}\\leq S\_\{d\}, which imply‖uθ​\(q,d\)‖∞≤R≜ψ​\(Sq\)​ϕ​\(Sd\)\\\|u\_\{\\theta\}\(q,d\)\\\|\_\{\\infty\}\\leq R\\triangleq\\psi\(S\_\{q\}\)\\phi\(S\_\{d\}\)\.

##### Coarse\-graining\.

A normalizer induces a many\-to\-one mapπ:V→V′\\pi:V\\to V^\{\\prime\}\. LetG∈ℝ≥0\|V′\|×\|V\|G\\in\\mathbb\{R\}\_\{\\geq 0\}^\{\|V^\{\\prime\}\|\\times\|V\|\}be row\-stochastic andπ\\pi\-respecting:Gu​t=0G\_\{ut\}=0ift∉π−1​\(u\)t\\notin\\pi^\{\-1\}\(u\)and∑t∈π−1​\(u\)Gu​t=1\\sum\_\{t\\in\\pi^\{\-1\}\(u\)\}G\_\{ut\}=1\. Define coarse\-grained features

uθ′​\(q,d\)≜G​uθ​\(q,d\)∈\[0,R\]\|V′\|\.u^\{\\prime\}\_\{\\theta\}\(q,d\)\\triangleq G\\,u\_\{\\theta\}\(q,d\)\\in\[0,R\]^\{\|V^\{\\prime\}\|\}\.

##### Hypothesis classes\.

With a sharedℓ1\\ell\_\{1\}budgetBB, define

\(1\)ℋV\\displaystyle\\mathcal\{H\}\_\{V\}=\{⟨β,uθ​\(q,d\)⟩:β≥0,‖β‖1≤B\},\\displaystyle=\\\{\\langle\\beta,u\_\{\\theta\}\(q,d\)\\rangle:\\ \\beta\\\!\\geq\\\!0,\\ \\\|\\beta\\\|\_\{1\}\\leq B\\\},\(2\)ℋV′\\displaystyle\\mathcal\{H\}\_\{V^\{\\prime\}\}=\{⟨β′,uθ′​\(q,d\)⟩:β′≥0,‖β′‖1≤B\}\.\\displaystyle=\\\{\\langle\\beta^\{\\prime\},u^\{\\prime\}\_\{\\theta\}\(q,d\)\\rangle:\\ \\beta^\{\\prime\}\\\!\\geq\\\!0,\\ \\\|\\beta^\{\\prime\}\\\|\_\{1\}\\leq B\\\}\.

##### Representation\-compatibility \(RC\)

Overly aggressive coarse\-graining \(e\.g\., heavy stemming\) can conflate meanings, so we focus on normalizers that mostly merge surface variants \(e\.g\., case folding\(Manninget al\.,[2008](https://arxiv.org/html/2607.00004#bib.bib9)\)\)\. The aggregationGGis*RC*if there existsεRC≥0\\varepsilon\_\{\\mathrm\{RC\}\}\\geq 0such that for everyβ≥0\\beta\\geq 0with‖β‖1≤B\\\|\\beta\\\|\_\{1\}\\leq Bthere isβ′≥0\\beta^\{\\prime\}\\geq 0,‖β′‖1≤B\\\|\\beta^\{\\prime\}\\\|\_\{1\}\\leq B, satisfying

\(RC⋆\)sup\(q,d\)\|⟨β,uθ​\(q,d\)⟩−⟨β′,uθ′​\(q,d\)⟩\|≤εRC\.\\sup\_\{\(q,d\)\}\\big\|\\langle\\beta,u\_\{\\theta\}\(q,d\)\\rangle\-\\langle\\beta^\{\\prime\},u^\{\\prime\}\_\{\\theta\}\(q,d\)\\rangle\\big\|\\leq\\varepsilon\_\{\\mathrm\{RC\}\}\.

### 3\.2\.Generalization under RC Coarse\-Graining

##### Rademacher tools\.

Letp∈ℕp\\in\\mathbb\{N\},B,R\>0B,R\>0, and𝒢p=\{⟨β,x⟩:β≥0,‖β‖1≤B\}\\mathcal\{G\}\_\{p\}=\\\{\\langle\\beta,x\\rangle:\\ \\beta\\\!\\geq\\\!0,\\ \\\|\\beta\\\|\_\{1\}\\leq B\\\}\. For a vector class𝒰⊂\[0,R\]p\\mathcal\{U\}\\subset\[0,R\]^\{p\}, we use theℓ∞\\ell\_\{\\infty\}\-type empirical Rademacher complexity

ℜ^n\(𝒰;∥⋅∥∞\)≜𝔼σ\[supu∈𝒰∥1n∑i=1nσiu\(zi\)∥∞\],\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{U\};\\\|\\cdot\\\|\_\{\\infty\}\)\\;\\triangleq\\;\\mathbb\{E\}\_\{\\sigma\}\\\!\\left\[\\sup\_\{u\\in\\mathcal\{U\}\}\\left\\\|\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}\\,u\(z\_\{i\}\)\\right\\\|\_\{\\infty\}\\right\],wherez1:nz\_\{1:n\}is the sample andσ1:n\\sigma\_\{1:n\}are i\.i\.d\. Rademacher signs\. Then the following bounds hold\(Bartlett and Mendelson,[2002](https://arxiv.org/html/2607.00004#bib.bib18); Mohriet al\.,[2018](https://arxiv.org/html/2607.00004#bib.bib15)\):

\(3\)ℜ^n\(𝒢p∘𝒰\)≤Bℜ^n\(𝒰;∥⋅∥∞\)\.\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{G\}\_\{p\}\\\!\\circ\\\!\\mathcal\{U\}\)\\ \\leq\\ B\\,\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{U\};\\\|\\cdot\\\|\_\{\\infty\}\)\.IfG∈ℝ≥0p′×pG\\in\\mathbb\{R\}\_\{\\geq 0\}^\{p^\{\\prime\}\\times p\}is row\-stochastic, then\(Horn and Johnson,[2013](https://arxiv.org/html/2607.00004#bib.bib22)\)

\(4\)ℜ^n\(G∘𝒰;∥⋅∥∞\)≤ℜ^n\(𝒰;∥⋅∥∞\)\.\\hat\{\\mathfrak\{R\}\}\_\{n\}\(G\\\!\\circ\\\!\\mathcal\{U\};\\\|\\cdot\\\|\_\{\\infty\}\)\\ \\leq\\ \\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{U\};\\\|\\cdot\\\|\_\{\\infty\}\)\.
###### Lemma 3\.1 \(Row\-stochastic aggregation does not increase feature\-class complexity\)\.

Consequences foruθ′=G​uθu^\{\\prime\}\_\{\\theta\}=G\\,u\_\{\\theta\}\. Recall𝒲V=\{uθ​\(⋅\)∈\[0,R\]\|V\|\}\\mathcal\{W\}\_\{V\}=\\\{u\_\{\\theta\}\(\\cdot\)\\in\[0,R\]^\{\|V\|\}\\\}and𝒲V′=\{uθ′​\(⋅\)=G​uθ​\(⋅\)∈\[0,R\]\|V′\|\}\\mathcal\{W\}\_\{V^\{\\prime\}\}=\\\{u^\{\\prime\}\_\{\\theta\}\(\\cdot\)=G\\,u\_\{\\theta\}\(\\cdot\)\\in\[0,R\]^\{\|V^\{\\prime\}\|\}\\\}, and the linear headsℋV,ℋV′\\mathcal\{H\}\_\{V\},\\mathcal\{H\}\_\{V^\{\\prime\}\}from \([1](https://arxiv.org/html/2607.00004#S3.E1)\)–\([2](https://arxiv.org/html/2607.00004#S3.E2)\)\. By \([4](https://arxiv.org/html/2607.00004#S3.E4)\),

\(5\)ℜ^n\(𝒲V′;∥⋅∥∞\)≤ℜ^n\(𝒲V;∥⋅∥∞\),\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{W\}\_\{V^\{\\prime\}\};\\\|\\cdot\\\|\_\{\\infty\}\)\\ \\leq\\ \\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{W\}\_\{V\};\\\|\\cdot\\\|\_\{\\infty\}\),

###### Theorem 3\.2 \(Sample complexity improves under RC coarse\-graining\)\.

Letℓ:ℝ→\[0,1\]\\ell:\\mathbb\{R\}\\to\[0,1\]beLL\-Lipschitz and leth^G∈ℋV′\\hat\{h\}\_\{G\}\\in\\mathcal\{H\}\_\{V^\{\\prime\}\}be an ERM onnnsamples\. Assume RC in \([RC⋆](https://arxiv.org/html/2607.00004#S3.Ex3)\), which impliesinfh∈ℋV′ℒ​\(h\)≤infh∈ℋVℒ​\(h\)\+L​εRC\\inf\_\{h\\in\\mathcal\{H\}\_\{V^\{\\prime\}\}\}\\mathcal\{L\}\(h\)\\ \\leq\\ \\inf\_\{h\\in\\mathcal\{H\}\_\{V\}\}\\mathcal\{L\}\(h\)\\ \+\\ L\\,\\varepsilon\_\{\\mathrm\{RC\}\}\. Then for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,

ℒ​\(h^G\)−infh∈ℋVℒ​\(h\)≤\\displaystyle\\mathcal\{L\}\(\\hat\{h\}\_\{G\}\)\-\\inf\_\{h\\in\\mathcal\{H\}\_\{V\}\}\\mathcal\{L\}\(h\)\\leq\\;LεRC\+4BLℜ^n\(𝒲V′;∥⋅∥∞\)\\displaystyle L\\,\\varepsilon\_\{\\mathrm\{RC\}\}\+4BL\\,\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{W\}\_\{V^\{\\prime\}\};\\\|\\cdot\\\|\_\{\\infty\}\)\(6\)\+2​Cgen​log⁡\(1/δ\)n\.\\displaystyle\+2C\_\{\\mathrm\{gen\}\}\\sqrt\{\\tfrac\{\\log\(1/\\delta\)\}\{n\}\}\.Moreover\|V′\|≤\|V\|\|V^\{\\prime\}\|\\leq\|V\|and \([5](https://arxiv.org/html/2607.00004#S3.E5)\) hold, hence the feature\-class complexity term does not increase while the estimation error bound tightens\. If, in addition,εRC\\varepsilon\_\{\\mathrm\{RC\}\}is sufficiently small, then the overall generalization bound under coarse\-graining is tighter\.

The proof follows from standard Rademacher symmetrization, Ledoux–Talagrand contraction, and McDiarmid’s inequality; full details are in Appendix[B](https://arxiv.org/html/2607.00004#A2)\.

##### Corollary \(pointwise & pairwise\)\.

The theorem holds verbatim for pointwise and pairwise training by replacinguθ​\(q,d\)u\_\{\\theta\}\(q,d\)with the triplet differenceuθ​\(q,d\+\)−uθ​\(q,d−\)u\_\{\\theta\}\(q,d^\{\+\}\)\-u\_\{\\theta\}\(q,d^\{\-\}\)\. Consequently, theℓ∞\\ell\_\{\\infty\}–type vector Rademacher complexity appearing in Theorem[3\.2](https://arxiv.org/html/2607.00004#S3.Thmtheorem2)does not increase by more than a factor of22\. Hence the generalization bound holds verbatim for both pointwise and pairwise training\.

### 3\.3\.Inference\-Free as a Special Case

For inference\-free sparse retrievers\(Shenet al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib12); Genget al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib7); Formalet al\.,[2021a](https://arxiv.org/html/2607.00004#bib.bib11)\), the query\-side weights are fixed incidence vectors determined by the query text \(i\.e\.,wθ,qw\_\{\\theta,q\}is replaced bywqw\_\{q\}\), so the preceding analysis applies verbatim\. Under RC coarse\-grainingGG, the induced feature\-class complexity under theℓ∞\\ell\_\{\\infty\}\-type Rademacher measure does not increase, hence the estimation term in Theorem[3\.2](https://arxiv.org/html/2607.00004#S3.Thmtheorem2)is no larger\.

##### Takeaway\.

This result suggests a trade\-off: coarse\-graining \(V→V′V\\to V^\{\\prime\}\) prevents the inflation of the Rademacher complexity term inherent to large raw vocabularies, potentially tightening the bound if the approximation costεRC\\varepsilon\_\{\\mathrm\{RC\}\}\(introduced by merging tokens\) is kept minimal\.

## 4\.Vocabulary Transfer \(VT\): From Theory to Minimal Migration Cost

### 4\.1\.Design Goals

Advanced encoders such as ModernBERT often underperform in neural sparse retrieval due to vocabulary mismatch and excessive surface\-form variability\. Our theoretical analysis in Section[3](https://arxiv.org/html/2607.00004#S3)suggests that a representation\-compatible, coarse\-grained vocabulary yields better generalization guarantees\. However, the theory describes the destination, not the path\. Training such a model from scratch is costly\. Therefore, our goal is to migrate a pretrained backbone from its source vocabularyVVto a more regularized, existing target vocabularyV′V^\{\\prime\}\(e\.g\.,bert\-base\-uncased\) atminimal migration cost\. Our method adheres to two principles: \(1\)Distributional Consistency, ensuring initialized parameters preserve the statistical priors of the target domain; and \(2\)Optimization Efficiency, prioritizing the adaptation of new semantic units to pre\-condition the model for downstream sparsity constraints\.

### 4\.2\.Method: A Three\-Step VT Recipe

Let the source model beℳ=\(V,E,𝐛\)\\mathcal\{M\}=\(V,E,\\mathbf\{b\}\), whereE∈ℝ\|V\|×dE\\in\\mathbb\{R\}^\{\|V\|\\times d\}is the embedding matrix and𝐛∈ℝ\|V\|\\mathbf\{b\}\\in\\mathbb\{R\}^\{\|V\|\}is the output bias vector representing unigram log\-probabilities\. The VT procedure producesℳ′=\(V′,E′,𝐛′\)\\mathcal\{M\}^\{\\prime\}=\(V^\{\\prime\},E^\{\\prime\},\\mathbf\{b\}^\{\\prime\}\)using the following steps\.

#### 4\.2\.1\.Step 1: Target Vocabulary Alignment

We align the model with a well\-normalized, lowercased target vocabularyV′V^\{\\prime\}, selected from existing Language Models\. This choice allows us to leverage their pre\-trained word embeddings for semantic initialization\.

#### 4\.2\.2\.Step 2: Embedding and Bias Initialization

LetO=V∩V′O=V\\cap V^\{\\prime\}denote overlapping tokens andN′=V′∖ON^\{\\prime\}=V^\{\\prime\}\\setminus Onew tokens\. Simple topological initialization is insufficient as it ignores discrepancies in prior probabilities\. We propose a joint initialization of spatial embeddings and biases:

Semantic Initialization via Spatial Topology\.Our goal is to initializeE′∈ℝ\|V′\|×dE^\{\\prime\}\\in\\mathbb\{R\}^\{\|V^\{\\prime\}\|\\times d\}so that \(i\) tokens shared by both vocabularies preserve the pretrained model’s geometry, and \(ii\) newly introduced units land in semantically plausible regions of the*source*embedding manifold, avoiding random starts that would otherwise require long adaptation\. This initialization aims to minimize the representation discrepancy \(related toεRC\\varepsilon\_\{\\mathrm\{RC\}\}in Theorem[3\.2](https://arxiv.org/html/2607.00004#S3.Thmtheorem2)\), ensuring that the starting point of the migration satisfies the compatibility assumptions made in our theoretical framework\. For overlap tokensu∈Ou\\in O, we keep the pretrained parameters unchanged:Eu′←EuE^\{\\prime\}\_\{u\}\\leftarrow E\_\{u\}\. For a new tokent∈N′t\\in N^\{\\prime\}, we*transfer neighborhoods*from the target embedding space into the source space using the overlap setOOas anchors\.

LetE~∈ℝ\|V′\|×d\\tilde\{E\}\\in\\mathbb\{R\}^\{\|V^\{\\prime\}\|\\times d\}denote pretrained embeddings associated with the target vocabulary \(e\.g\.,bert\-base\-uncased\)\. We first compute an affinity vector𝐬t∈ℝ\|O\|\\mathbf\{s\}\_\{t\}\\in\\mathbb\{R\}^\{\|O\|\}betweenttand each anchoru∈Ou\\in Ousing cosine similarity:

\(7\)st,u=cos⁡\(E~t,E~u\)=E~t⊤​E~u‖E~t‖​‖E~u‖\.s\_\{t,u\}=\\cos\(\\tilde\{E\}\_\{t\},\\tilde\{E\}\_\{u\}\)=\\frac\{\\tilde\{E\}\_\{t\}^\{\\top\}\\tilde\{E\}\_\{u\}\}\{\\\|\\tilde\{E\}\_\{t\}\\\|\\,\\\|\\tilde\{E\}\_\{u\}\\\|\}\.A dense interpolation over all anchors is undesirable: it blurs semantic neighborhoods and may introduce spurious mass on weakly related anchors, especially when\|O\|\|O\|is large\. We therefore convert affinities into a*sparse*convex weighting𝜶t\\boldsymbol\{\\alpha\}\_\{t\}by projecting onto the simplex withsparsemax\(Martins and Astudillo,[2016](https://arxiv.org/html/2607.00004#bib.bib94)\):

\(8\)𝜶t=sparsemax​\(𝐬t\)=argmin𝐩∈Δ\|O\|−1‖𝐩−𝐬t‖2,\\boldsymbol\{\\alpha\}\_\{t\}=\\text\{sparsemax\}\(\\mathbf\{s\}\_\{t\}\)=\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{p\}\\in\\Delta^\{\|O\|\-1\}\}\\\|\\mathbf\{p\}\-\\mathbf\{s\}\_\{t\}\\\|^\{2\},which yields𝜶t≥0\\boldsymbol\{\\alpha\}\_\{t\}\\geq 0,∑u∈Oαt,u=1\\sum\_\{u\\in O\}\\alpha\_\{t,u\}=1, and only a small subset of nonzero neighbors\. Finally, we synthesize the source\-space initialization by barycentric interpolation over the corresponding*source*anchors:

\(9\)Et′←∑u∈Oαt,u​Eu\.E^\{\\prime\}\_\{t\}\\leftarrow\\sum\_\{u\\in O\}\\alpha\_\{t,u\}\\,E\_\{u\}\.This constructsEt′E^\{\\prime\}\_\{t\}as the point whose local neighborhood \(with respect to anchors\) matches that ofttin the target space, effectively preserving*relative*semantic topology while staying on the pretrained source manifold\. In practice, this produces meaningful embeddings before any MLM adaptation, substantially reducing the optimization burden for newly introduced tokens\.

Alternatively, for custom vocabularies lacking a corresponding pre\-trained model, we employSub\-token Initialization\. We tokenize each new tokenttusing the source tokenizer into constituent sub\-tokens and initializeEt′E^\{\\prime\}\_\{t\}as the mean of their source embeddings\. This constructs a semantic approximation from the source model’s existing sub\-word units\.

Prior\-Aware Distribution Alignment\.The output bias𝐛\\mathbf\{b\}captures unigram priors\. To map the target vocabulary’s prior structure into the source model’s dynamic range, we apply a*Z\-score distribution transfer*:

\(10\)𝐛′←μ​\(𝐛s​r​c\)\+σ​\(𝐛s​r​c\)⋅𝐛t​g​t−μ​\(𝐛t​g​t\)σ​\(𝐛t​g​t\)\\mathbf\{b\}^\{\\prime\}\\leftarrow\\mu\(\\mathbf\{b\}\_\{src\}\)\+\\sigma\(\\mathbf\{b\}\_\{src\}\)\\cdot\\frac\{\\mathbf\{b\}\_\{tgt\}\-\\mu\(\\mathbf\{b\}\_\{tgt\}\)\}\{\\sigma\(\\mathbf\{b\}\_\{tgt\}\)\}whereμ​\(⋅\)\\mu\(\\cdot\)andσ​\(⋅\)\\sigma\(\\cdot\)denote mean and standard deviation\. This ensures common words inV′V^\{\\prime\}receive higher initial biases while strictly adhering to the logit scale expected by the source Transformer\.

#### 4\.2\.3\.Step 3: Discrepancy\-Aware Adaptation

We freeze the Transformer layers and update only the embedding layer via a short Masked Language Modeling \(MLM\) phase, utilizing two efficiency mechanisms:

Overlap\-Aware Masking Curriculum\.Uniform masking is inefficient since tokens inOOare already well\-learned\. We use importance sampling for masking probabilitiesPmask​\(t\)∝ωtP\_\{\\text\{mask\}\}\(t\)\\propto\\omega\_\{t\}, whereωt=1\\omega\_\{t\}=1ift∈Ot\\in Oandωt=λ\\omega\_\{t\}=\\lambdaift∈N′t\\in N^\{\\prime\}\(λ\>1\\lambda\>1\)\. This curriculum focuses gradients on unaligned regions to accelerate convergence while preventing catastrophic forgetting\.

Activation Potential Calibration\.Sparse retrievers like SPLADE rely on ReLU to induce sparsity\. However, improper initialization hinders adaptation: globally low logits lead to low activation rates and “dead neurons,” while globally high logits result in high activation rates and large magnitudes, producing excessive inner products that cause “dense collapse\.” We shift the bias𝐛′\\mathbf\{b\}^\{\\prime\}by a scalarccafter MLM \(𝐛′←𝐛′−c\\mathbf\{b\}^\{\\prime\}\\leftarrow\\mathbf\{b\}^\{\\prime\}\-c\), whereccis determined by probing a subset of the training data\. This calibration places the activation rate within a moderate range and ensures that non\-zero activations follow a long\-tail distribution from zero to the maximum value, creating an ideal regime for margin\-based distillation\.

Table 1\.Main Results on BEIR\.We report efficiency metrics \(document length and FLOPs\) and nDCG10performance for each dataset\.†\\daggerindicates models provided by us\. The best performance results arebolded\.EfficiencyPerformance per Dataset \(nDCG10\)AvgModelDoc\_LenFLOPsArgCliDBPFEVFiQAHotNFCNQQuoSCISciFTouTRECnDCGCo\-SelfDistil\(Formalet al\.,[2022](https://arxiv.org/html/2607.00004#bib.bib45)\)197\.011\.149\.326\.244\.181\.636\.269\.335\.454\.285\.016\.071\.524\.872\.451\.2Co\-EnsembleDistil\(Formalet al\.,[2022](https://arxiv.org/html/2607.00004#bib.bib45)\)159\.57\.850\.824\.443\.680\.035\.568\.735\.353\.983\.415\.870\.827\.372\.550\.9SPLADE\-v3\(Lassanceet al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib46)\)213\.28\.048\.725\.645\.181\.038\.169\.236\.358\.781\.415\.671\.631\.273\.152\.0BERT \(Splade\)†198\.713\.051\.624\.743\.481\.035\.169\.234\.953\.880\.715\.771\.325\.870\.850\.6ModernBERT\-VT†203\.19\.149\.027\.845\.184\.436\.368\.835\.855\.983\.615\.670\.533\.075\.952\.4

## 5\.Experimental Setup

### 5\.1\.Datasets and Evaluation Metrics

##### Training Data

For theMLM adaptationphase, we use the combined English Wikipedia and BookCorpus\(Zhuet al\.,[2015](https://arxiv.org/html/2607.00004#bib.bib90)\)datasets, comprising approximately 6\.2 million documents and 3\.7 billion tokens\. For thesparse retrieval fine\-tuning, we utilize the MS MARCO Passage Retrieval dataset\(Nguyenet al\.,[2016](https://arxiv.org/html/2607.00004#bib.bib86)\)\. Specifically, we use theidentical training data withSPLADE\-EnsembleDistil, i\.e\., themsmarco\-hard\-negatives222https://huggingface\.co/datasets/sentence\-transformers/msmarco\-hard\-negativesdataset, which includes hard negatives mined from MS MARCO and supervisory scores generated by a cross\-encoder teacher model333https://huggingface\.co/cross\-encoder/ms\-marco\-MiniLM\-L6\-v2\.

##### Evaluation Benchmarks

For in\-domain evaluation, we reportMRR@10andRecall@1000on the MS MARCO official development set \(Dev\)\. We also ensure robust evaluation using the TREC\-DL 2019\(Craswellet al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib87)\)query sets, reportingnDCG@10andRecall@1000\. To evaluatezero\-shotgeneralization, we test on the BEIR benchmark\(Thakuret al\.,[2021](https://arxiv.org/html/2607.00004#bib.bib34)\)\. Following previous work\(Formalet al\.,[2021b](https://arxiv.org/html/2607.00004#bib.bib8),[a](https://arxiv.org/html/2607.00004#bib.bib11); Lassance and Clinchant,[2022](https://arxiv.org/html/2607.00004#bib.bib47); Lassanceet al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib46); Genget al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib7)\), we use a subset of 13 datasets:TREC\-COVID, NFCorpus, NQ, HotpotQA, FiQA\-2018, ArguAna, Webis\-Touché2020, DBPedia\-Entity, SCIDOCS, FEVER, Climate\-FEVER, SciFact, and Quora\. We reportnDCG@10\.

### 5\.2\.Models and Baselines

Our analysis centers on the performance disparity between established encoders and modern architectures in sparse retrieval\.

##### Backbones

We utilizeanswerdotai/ModernBERT\-base\(Warneret al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib30)\)as our primary modern backbone\. To demonstrate our VT method, we migrate this model to the vocabulary ofbert\-base\-uncased\. In this transfer, overlapping tokens account for 56\.4% of the vocabulary, while new tokens account for 43\.6%\. We compare these against standard BERT\-based sparse implementations\.

##### Baselines

We compare our models against a comprehensive set of baselines categorized into two groups\.Reference Baselines:These include the traditional lexical model BM25 and established dense and sparse neural retrievers such as DPR, CoCondenser, ColBERTv2, uniCOIL, DeepImpact\(Malliaet al\.,[2021](https://arxiv.org/html/2607.00004#bib.bib27)\), and DeeperImpact\. Baseline results are sourced directly from their respective original publications\.SPLADE Family & Derivatives:We compare against the standard SPLADE models, specifically CoCondenser\-SelfDistil and CoCondenser\-EnsembleDistil\(Formalet al\.,[2022](https://arxiv.org/html/2607.00004#bib.bib45)\), as well as the recent SPLADE\-v3\(Lassanceet al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib46)\)and ESPLADE\(Dudeket al\.,[2023](https://arxiv.org/html/2607.00004#bib.bib84)\)\. To ensure a fair comparison and eliminate discrepancies arising from different evaluation pipelines, we re\-evaluated all available open\-source SPLADE checkpoints using our own evaluation pipeline\.

### 5\.3\.Implementation Details

##### Training Protocol

For theVT Adaptation \(MLM\)phase, we train for 20k steps \(approx\. 1 epoch\) on Wikipedia and BookCorpus\. We use the AdamW optimizer with a learning rate of 3e\-4, a cosine scheduler with 4,000 warmup steps, and a global batch size of 2,048\. Input sequences are truncated to 128 tokens, and the MLM masking probability is set to 0\.3\. In the Overlap\-Aware Masking Curriculum, the importance sampling weight for new tokens is set toλ=2\\lambda=2\. For theActivation Potential Calibration \(APC\), we set the scalar shiftc=5c=5, which results in an initial activation rate of approximately 40%\.

ForSparse Retrieval Fine\-tuning, our training pipeline and configuration align strictly with established protocols ofSPLADE\-EnsembleDistil\(Formalet al\.,[2022](https://arxiv.org/html/2607.00004#bib.bib45)\)to isolate the impact of the backbone and vocabulary\. We employ a knowledge distillation approach\. The model is fine\-tuned using the MarginMSE loss combined with FLOPs regularization\(Pariaet al\.,[2020](https://arxiv.org/html/2607.00004#bib.bib72)\)to control sparsity\. All models are trained with a maximum sequence length of 256 tokens\.

##### Evaluation Configuration

During Evaluation, document inputs are truncated to a maximum length of 512 tokens\. We use OpenSearch444https://opensearch\.org/as our lexical search engine to construct the inverted index and perform the retrieval process\. Metrics are computed using the official BEIR evaluation toolkit\.

##### Reproducibility

To ensure reproducibility, we fix the random seed to 42 for all experiments\. We execute training for a fixed 150k steps and select the final checkpoint for evaluation\. The implementation are based on PyTorch\(Paszkeet al\.,[2019](https://arxiv.org/html/2607.00004#bib.bib88)\)and HuggingFace transformers\(Wolfet al\.,[2019](https://arxiv.org/html/2607.00004#bib.bib89)\)library\. All experiments were conducted on 8 NVIDIA A100 Tensor Core GPUs \(80GB VRAM\)\.

## 6\.Results and Analysis

Table 2\.Main Results on BEIR \(OOD\) and MS MARCO \(In\-Domain\)\.For SPLADE\-based models, we report knowledge distillation \(KD\) teacher numbers\. We report average nDCG10on BEIR\.†\\daggerindicates models provided by us\. For our primary model \(ModernBERT\-VT\), subscripts denote the standard deviation over 5 random seeds; other entries use a single seed \(42\)\. The best results in each group arebolded, and the best overall results areunderlined\.ModelTeach\.BEIRMSMDL\-19MRR10R1knDCG10R1kReference BaselinesBM25\-43\.718\.485\.350\.674\.5DPR\(Karpukhinet al\.,[2020b](https://arxiv.org/html/2607.00004#bib.bib36)\)\-37\.531\.994\.161\.174\.2CoCondenser\(Gao and Callan,[2021](https://arxiv.org/html/2607.00004#bib.bib55)\)\-42\.038\.298\.467\.482\.0ColBERTv2\(Santhanamet al\.,[2022](https://arxiv.org/html/2607.00004#bib.bib70)\)\-50\.039\.798\.574\.488\.2uniCOIL\(Lin and Ma,[2021](https://arxiv.org/html/2607.00004#bib.bib91); Thakuret al\.,[2023](https://arxiv.org/html/2607.00004#bib.bib92)\)\-44\.135\.1\-69\.3\-DeepImpact\(Malliaet al\.,[2021](https://arxiv.org/html/2607.00004#bib.bib27); Thakuret al\.,[2023](https://arxiv.org/html/2607.00004#bib.bib92)\)\-41\.532\.794\.869\.5\-DeeperImpact\(Basnetet al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib93)\)\-\-37\.396\.8\-\-BERT \(dense\)†\-42\.132\.993\.365\.764\.7ModernBERT \(dense\)†\-44\.232\.994\.863\.765\.7SPLADE Family & DerivatesCo\-SelfDistil\(Formalet al\.,[2022](https://arxiv.org/html/2607.00004#bib.bib45)\)151\.237\.598\.573\.583\.4Co\-EnsembleDistil\(Formalet al\.,[2022](https://arxiv.org/html/2607.00004#bib.bib45)\)150\.938\.398\.373\.183\.0ESPLADE\(Dudeket al\.,[2023](https://arxiv.org/html/2607.00004#bib.bib84)\)151\.238\.198\.3\-\-SPLADE\-v3\(Lassanceet al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib46)\)552\.040\.098\.772\.783\.2Our Results†BERT \(Splade\)150\.637\.598\.272\.882\.1\+ further MLM150\.537\.798\.172\.682\.2ModernBERT \(Splade\)147\.735\.798\.066\.379\.9\+ further MLM147\.535\.497\.869\.079\.4\+ lowercase input146\.836\.397\.969\.280\.5ModernBERT\-VT152\.40\.0438\.30\.0898\.40\.0273\.70\.783\.00\.3

### 6\.1\.RQ1: Effectiveness of Vocabulary Transfer

We analyze VT’s effectiveness in enabling advanced encoders for sparse retrieval, comparing against baselines from Section[5](https://arxiv.org/html/2607.00004#S5)\. We include SPLADE\-v3 as a reference point using publicly released checkpoints; however, SPLADE\-v3 employs additional training structure changes and ensemble teacher scores, making it not strictly controlled under our training setting\. Therefore, our main comparisons focus on SPLADE\-ensemble\-distill, which matches our supervision and pipeline\.

#### 6\.1\.1\.The Vocabulary Gap Anomaly

As shown in Table[2](https://arxiv.org/html/2607.00004#S6.T2), the dense retrieval results confirm the superiority of modern architectures:ModernBERT \(dense\)achieves an average nDCG@10 of 44\.2 on BEIR, outperformingBERT \(dense\)at 42\.1\. However, this advantage vanishes when applying standard sparse fine\-tuning\. NaiveModernBERT \(Splade\)lags behind the olderCo\-Ensemble Distil\(47\.7 vs 50\.9 on BEIR; 35\.7 vs 38\.3 MRR on MS MARCO\), confirming that advanced encoders with raw vocabularies are ill\-suited for sparse retrieval without adaptation\.

#### 6\.1\.2\.Effectiveness of Vocabulary Transfer

Applying VT effectively bridges this gap\. OurModernBERT\-VTmodel achieves a BEIR score of52\.4, representing a substantial improvement over the naive implementation\. In our internal comparison of ”controlled” implementations \(rows marked with†\\dagger\), we observe distinct trends\.

Naive vs\. Lowercase\.Simply forcing the input to be lowercased \(\+ lowercase input\) provides mixed results\. While it offers a slight relief for in\-domain matching \(improving MRR from 35\.7 to 36\.3\), it degrades zero\-shot performance on BEIR \(dropping from 47\.7 to 46\.8\)\. This suggests that while preprocessing can help align some surface forms, it fails to fully leverage the semantic capacity of the backbone and breaks the token integrity required for generalization\.

Ineffectiveness of Further MLM\.Simply continuing MLM on the target corpus \(\+ further MLM\) fails to improve performance \(BEIR drops to 47\.5\), confirming the gains stem from VT rather than extra computation\.

The VT Advantage\.ModernBERT\-VToutperforms both the naive and lowercased variants by a wide margin, validating that proper vocabulary adaptation is essential to release the potential of ModernBERT\.

#### 6\.1\.3\.Comparison with State\-of\-the\-Art

ModernBERT\-VTestablishes a new state\-of\-the\-art on BEIR compared to other sparse retrievers\. On BEIR,ModernBERT\-VT\(52\.4\) outperformsCo\-Ensemble Distil\(50\.9\) and evenSPLADE\-v3\(52\.0\), despite the latter utilizing a complex 5\-teacher ensemble training pipeline\. Table[1](https://arxiv.org/html/2607.00004#S4.T1)confirms these gains maintain efficiency comparable to SPLADE baselines, validating that VT effectively unlocks ModernBERT’s robust generalization capabilities\. On MS MARCO,ModernBERT\-VT\(38\.3 MRR@10\) surpasses all baselines exceptSPLADE\-v3\(40\.0\)\. Crucially,SPLADE\-v3benefits from orthogonal enhancements \(multi\-stage ensemble distillation\), whereas we use a standard single\-teacher setup to strictly isolate the backbone’s impact\. To verify statistical reliability, we additionally trainModernBERT\-VTwith five random seeds: the resulting small standard deviations \(BEIR:±0\.04\\pm 0\.04; MRR@10:±0\.08\\pm 0\.08\) confirm that our reported gains over baselines are not due to seed variance\.

#### 6\.1\.4\.Summary

Results show the ”lag” in advanced encoders is a vocabulary alignment issue, not architectural\. VT unlocks ModernBERT’s reasoning power for sparse retrieval, retaining its superior generalization and achieving competitive in\-domain performance, effectively closing the dense\-sparse gap\.

### 6\.2\.RQ2: Impact of Initialization Strategies and VT Components

To verify our VT recipe, we conduct ablation studies on BEIR \(OOD\) and MS MARCO \(In\-Domain\), examining embedding initialization strategies and adaptation objectives \(PDA, OMC, APC\)\. Results are summarized in Table[3](https://arxiv.org/html/2607.00004#S6.T3)\.

Table 3\.Ablation on BEIR \(OOD\) and MS MARCO \(In\-Domain\)\. We study initialization strategies and VT components\. “Direct” denotes fine\-tuning without MLM; “Adapted” applies 20k steps \(approx\. 1 epoch\) MLM before fine\-tuning\.#### 6\.2\.1\.Impact of Initialization Strategies

We compareSemantic\-Initagainst several baselines:Rand\-All\(randomizes all embeddings\);Rand\-New\(randomizes only non\-overlapping tokens\);Mean\-Init\(sets new tokens to the overlapping vocabulary centroid\); andSubToken\-Init\(averages constituent sub\-tokens\)\. We report results for two settings: “Direct” and “Adapted”\.

Semantic\-Init is superior to all other methods\.Table[3](https://arxiv.org/html/2607.00004#S6.T3)shows thatSemantic\-Initconsistently outperforms other strategies\. In the Direct setting, it provides a strong “warm\-start” \(51\.1 on BEIR\)\. MLM adaptation further boosts this to optimal performance \(52\.4\)\. Surprisingly, just 500 MLM steps yield near\-optimal results \(52\.2\), validating that our method offers a high\-quality starting point requiring minimal gradient updates to align the vocabulary\.

SubToken\-Init is robust but suffers from semantic collapse in Direct FT\.SubToken\-Initis a competitive baseline when target embeddings are missing, matchingSemantic\-Initafter adaptation \(52\.3 vs 52\.4 on BEIR\)\. However, in Direct fine\-tuning, its high In\-Domain accuracy \(37\.7\) contrasts with a significant OOD drop \(49\.3\)\. This discrepancy implies “semantic collapse,” where embeddings overfit the training domain while drifting from intrinsic semantics\. Table[4](https://arxiv.org/html/2607.00004#S6.T4)illustrates whySubToken\-Initcan be suboptimal\. While effective for clean splits \(e\.g\., “nationalists”\),SubToken\-Initfails on ambiguous fragments like “clears” \(→\\rightarrow“cle”, “ars”\) or “centimetres” \(→\\rightarrow“cent”, “imet”, “res”\)\. In contrast,Semantic\-Initidentifies robust neighbors independent of surface forms\.

Normalization compensates for random initialization\.Surprisingly, evenRand\-Newsurpasses theModernBERT \(Splade\)baseline in both settings\. This suggests that the gains from normalizing the sparse output space outweigh the degradation caused by randomly initializing a portion of the vocabulary\.

Table 4\.Case study of sub\-token decompositions and top\-5 semantic neighbors and weights for new tokens\.1\) Word:collaborationsSub\-tokens:\["Ġcollabor", "ations"\]Top\-5:\(Ġcollaborate, 0\.123\), \(Ġcollaboration, 0\.118\), \(Ġcollaborators, 0\.107\), \(Ġcollaborated, 0\.098\), \(Ġcollaborative, 0\.090\)2\) Word:nationalistsSub\-tokens:\["Ġnational", "ists"\]Top\-5:\(nationalist, 0\.192\), \(nationalism, 0\.126\), \(liberals, 0\.056\), \(conservatives, 0\.044\), \(militants, 0\.039\)3\) Word:clearsSub\-tokens:\["Ġcle", "ars"\]Top\-5:\(cleared, 0\.165\), \(clearing, 0\.064\), \(clearer, 0\.049\), \(removes, 0\.043\), \(facilitates, 0\.030\)4\) Word:centimetresSub\-tokens:\["Ġcent", "imet", "res"\]Top\-5:\(centimeters, 0\.178\), \(cm, 0\.169\), \(kilograms, 0\.068\), \(inches, 0\.056\), \(kilometres, 0\.051\)
#### 6\.2\.2\.Impact of Adaptation Components

We dissect the VT adaptation phase by removing specific components from the optimalSemantic\-Initconfiguration\.

Prior\-Aware Distribution Alignment \(PDA\) and Overlap\-Aware Masking Curriculum \(OMC\)\.Both components are vital for performance\. Removing PDA degrades in\-domain \(MS MARCO MRR:38\.3→38\.138\.3\\rightarrow 38\.1\) and OOD metrics, while removing OMC leads to sub\-optimal convergence\. Together, they ensure the model learns new vocabulary effectively without overfitting or forgetting pre\-trained knowledge of overlapping tokens\.

The Critical Role of Activation Potential Calibration \(APC\)\.Removing APC causes the most significant impact\. We find that MLM adaptation sharpens output logits, making them ill\-positioned for SPLADE’s ReLU activation\. This causes instability: without APC, the “Adapted” model performs worse on BEIR \(48\.8\) than the “Direct” baseline \(50\.7\)\. APC realigns the activation potential, allowing sparsity regularization to operate effectively and translating semantic gains into retrieval performance\.

Table 5\.Sensitivity analysis of APC target activation rateon ModernBERT\-VT\. We report BEIR \(nDCG10\), FLOPs, and MS MARCO \(MRR10, R1k\)\. Our defaultc=5c\{=\}5falls between the 30% and 40% activation rate settings\.Sensitivity to APC Target Activation Rate\.Table[5](https://arxiv.org/html/2607.00004#S6.T5)reports results across target activation rates from 10% to 90%\. BEIR performance forms a plateau in the 30–50% activation rate range \(52\.2–52\.4\), indicating that the method is robust to the exact choice ofccand does not require per\-backbone tuning\. Above 60% activation rate, excessive sparsity suppression starves the sparse representation of active dimensions, degrading OOD generalization sharply\. Below 20%, the increased activation density raises FLOPs without improving retrieval quality\. In\-domain MS MARCO metrics remain stable throughout \(MRR@10: 37\.2–38\.5\), confirming that APC primarily governs the effectiveness–efficiency trade\-off for out\-of\-domain transfer\. Our defaultc=5c\{=\}5falls between the 30% and 40% activation rate settings, sitting squarely within this plateau and balancing strong BEIR performance with moderate computational cost\.

### 6\.3\.RQ3: Full\-Vocabulary Transfer vs\. Head\-Only Adaptation

We compare full\-vocabulary transfer \(VT\) againsthead\-onlyadaptation, a natural alternative where the backbone tokenizer is kept intact while only the output decoder head is replaced to match the target vocabulary\. This design, exemplified by ESPLADE\(Dudeket al\.,[2023](https://arxiv.org/html/2607.00004#bib.bib84)\), aims to decouple the encoder’s input space from its output feature space\.

Experimental SetupWe implementModernBERT\-ESPLADEby replacing only the importance head with our semantic initialization\. We further evaluate two adaptation strategies: \(1\)\+EMLM, which applies the unsupervised task from\(Dudeket al\.,[2023](https://arxiv.org/html/2607.00004#bib.bib84)\); and \(2\)\+SAP, which distills sparse lexical representations from a strong teacher \(SPLADE\-v3\) following MILCO\(Nguyenet al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib96)\)\.

VT Outperforms Head\-only AdaptationAs shown in Table[6](https://arxiv.org/html/2607.00004#S6.T6), full\-vocabulary transfer consistently dominates head\-only designs\. Even without adaptation, VT surpasses ModernBERT\-ESPLADE by \+1\.7 BEIR nDCG@10, suggesting that aligning the input interface with the target domain is crucial\. While SAP improves performance, it remains inferior to VT across all benchmarks \(e\.g\., 50\.5 vs\. 52\.4 on BEIR\) and requires an additional dependency on a teacher model\.

Table 6\.Full\-vocabulary transfer vs\. ESPLADE\-style head\-only adaptation\.The Alignment BottleneckNotably, EMLM degrades performance in our setting \(49\.4→\\rightarrow48\.0\)\. We attribute this to themany\-to\-one mapping conflict: when the target vocabulary is more granular than the source \(target tokensN\>N\>source tokensMM\), ESPLADE’s “first\-overlap” supervision forcesNNlabels onto a single source position\. This ill\-defined mapping creates ambiguous training signals that confuse the model\. Importantly, this issue is avoided in the*original*EMLM setting, where the target vocabulary is a*word\-level unigram*lexicon instead of sub\-tokens\. In contrast, VT unifies the input and output tokenization for MLM training\.

ConclusionModifying only the language\-model head is sub\-optimal\. Because the input embedding space remains unchanged, the encoder still organizes representations around the*source*subword units, and the new head must learn a difficult post\-hoc translation into the target sparse vocabulary from sparse retrieval supervision alone\. VT is essential to fully synchronize the model’s internal reasoning with the sparse retrieval objective\.

### 6\.4\.RQ4: Generalization Across Different Backbones

To assess the universality of our approach, we extendVTto several widely used encoder architectures, includingRoBERTa\-base,RoBERTa\-large, andBERT\-base\-cased\. Note that RoBERTa models employ a case\-sensitive BPE tokenizer with a vocabulary size of 50,265, whileBERT\-base\-caseduses a case\-sensitive WordPiece tokenizer with 28,996 tokens\. In this section, we use VT to migrate each backbone’s native vocabulary to the vocabulary ofbert\-base\-uncased, while keeping all other settings identical to those used in Table[2](https://arxiv.org/html/2607.00004#S6.T2), including the training protocol and distillation pipeline\. Specifically, for RoBERTa, overlapping tokens account for 59\.6% and additional tokens for 40\.4%; forBERT\-base\-cased, overlapping tokens constitute 60\.9% and new tokens 39\.1%\. The results in Table[7](https://arxiv.org/html/2607.00004#S6.T7)reveal several key insights:

Table 7\.Generalization of VT Across Other Backbones\.We report average nDCG10on BEIR, and MRR/nDCG metrics for MS MARCO and DL\-2019\. The best results for each backbone arebolded\.The vocabulary gap is universal\.The performance degradation is not unique to ModernBERT: standard SPLADE fine\-tuning on RoBERTa\-base \(48\.0 nDCG@10 on BEIR\) lags behind the BERT\-base baseline \(50\.6\)\. More strikingly, RoBERTa\-large and BERT\-base\-cased suffer fromdense collapsedue to universally high activation values, yielding near\-zero performance \(e\.g\., 1\.4 and 10\.6 on BEIR\)\. This indicates that raw, case\-sensitive vocabularies are fundamentally incompatible with sparse retrieval objectives across model families and scales\.

VT consistently restores and improves performance\.Applying VT closes these gaps across all backbones\. In particular, VT elevates RoBERTa\-large from a non\-functional regime \(1\.4\) to strong performance \(51\.3 on BEIR\), surpassing the BERT\-base baseline and matching the gains observed with ModernBERT\.

VT is robust to model scale and tokenizer type\.Its outstanding performance on both RoBERTa\-large \(355M parameters\) and BERT\-base\-cased \(wordpiece tokenizer\) highlights VT’s ability to seamlessly adapt to larger parameter spaces and diverse tokenization schemes, all without the need for any additional continuous pre\-training\.

Overall, these results suggest that VT is a*model\-agnostic*solution, providing a robust pathway for integrating diverse encoder architectures into the sparse retrieval paradigm by directly resolving the underlying lexical mismatch\.

### 6\.5\.RQ5: Generalization to Inference\-Free Retrieval

To verify the robustness of VT across different sparse retrieval architectures, we evaluate its performance in an inference\-free setting\(Genget al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib7); Shenet al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib12)\)\. Following the experimental protocol ofShenet al\.\([2025](https://arxiv.org/html/2607.00004#bib.bib12)\), we maintain an identical training pipeline, dataset, and hyperparameter configuration, replacing only the backbone encoder to ModernBERT to isolate the impact of VT\. The results in Table[8](https://arxiv.org/html/2607.00004#S6.T8)reveal several key insights:

Table 8\.Performance of VT\-adapted models on Inference\-free LSR tasks\. FollowShenet al\.\([2025](https://arxiv.org/html/2607.00004#bib.bib12)\), we employ IDF enhancement\(Genget al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib7)\)andℓ0\\ell\_\{0\}masked flops\(Shenet al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib12)\)\.Persistence of the Vocabulary Gap\.Similar to the standard SPLADE setting, the naive ModernBERT underperforms significantly in the inference\-free regime, with its 44\.4 NDCG@10 failing to even match the BM25 baseline\. Notably, the naive model yields longer document lengths yet lower FLOPS than baselines, suggesting that the lexical mismatch of surface variants is more pronounced in the inference\-free architecture\.

Effectiveness of VT\.ModernBERT\-VT effectively bridges this gap, achieving a state\-of\-the\-art NDCG@10 of 51\.5\. This represents a substantial improvement over both the SPLADE\-v3 baseline and theℓ0\\ell\_\{0\}\-enhanced models\(Shenet al\.,[2025](https://arxiv.org/html/2607.00004#bib.bib12)\)\.

Efficiency Synergy\.When combined withl0l\_\{0\}\-flops andl0l\_\{0\}\-activation, ModernBERT\-VT achieves the best balance between effectiveness and efficiency, yielding the shortest average document length \(265\.3\) and competitive FLOPs \(2\.1\)\.

These findings demonstrate that the benefits of VT are not limited to dual\-encoder sparse retrievers but extend to high\-efficiency, inference\-free indices, allowing modern backbones to realize their full potential in latency\-critical applications\.

### 6\.6\.RQ6: Domain Specialization via Vocabulary Transfer

Learned Sparse Retrieval \(LSR\) is particularly sensitive to tokenization in specialized domains \(e\.g\., Chemistry\), where general\-purpose tokenizers often over\-fragment technical terms\. We investigate whether VT can effectively migrate a general backbone to adomain\-specificvocabulary synthesized from scratch\.

Experimental SetupWe train five BPE tokenizers \(sizes 10k–50k\) on thedolma\-chemcorpus\(BASF\-AI,[2025](https://arxiv.org/html/2607.00004#bib.bib98)\), utilizing BERT\-style normalization \(lowercase, stripping accents\)\. We migrate ModernBERT\-base to these vocabularies using VT with sub\-token initialization, followed by a brief MLM adaptation \(3k steps\)\. Models are fine\-tuned on 200k chemistry query\-document pairs using InfoNCE loss\. We evaluate onChemHotpotQAandChemNQ\(Kasmaeeet al\.,[2024](https://arxiv.org/html/2607.00004#bib.bib97)\)\. Due to the limited size of the training data, we observed variance in model performance\. To ensure the reliability of our results, we report the mean and standard deviation across five random seeds\.

Table 9\.Effectiveness of Domain\-Specific Vocabulary Transfer\.Results are nDCG10\(Mean±\\pmStd\) over 5 runs\. “Frag\.” denotes the fragmentation rate \(Tokens/Word\)\.Results and analysisVT markedly improves cross\-domain transfer to chemistry, yielding large gains over the original general\-domain vocabulary on both datasets\. Across vocab sizes, fragmentation decreases monotonically as expected, but retrieval performance does not: ChemHotpotQA peaks at 40k, whereas ChemNQ peaks at 10k\. This decoupling indicates that reduced token fragmentation is beneficial but insufficient to predict adaptation quality\. With limited domain MLM \(3k steps\) and only 200k supervised pairs, larger vocabularies introduce more rare sub\-tokens whose embeddings and lexical weights are weakly trained, which can increase variance and hurt generalization \(notably onChemNQ\)\. Overall, these results support VT as an effective mechanism for*cross\-domain*adaptation of sparse retrievers, while highlighting that vocab\-size selection remains a non\-trivial trade\-off between tokenization adequacy and parameter/data efficiency\.

## 7\.Conclusion

In this work, we demonstrate that the performance degradation of advanced encoders in sparse retrieval is not an architectural deficiency but a consequence of the vocabulary gap\. By shifting from lossless\-reconstruction\-oriented modern tokenization to normalized, representation\-compatible vocabularies, we unlock the reasoning power of next\-generation backbones for lexical matching\. Our proposed VT method provides a robust and efficient solution, allowing models like ModernBERT and RoBERTa to adapt to sparse\-friendly vocabularies via geometric initialization and minimal adaptation steps\. VT not only restores the competitiveness of these models but establishes new state\-of\-the\-art results across out\-of\-domain benchmarks and inference\-free architectures\. Ultimately, our findings highlight that vocabulary design is a fundamental bottleneck in neural sparse retrieval and offer a universal recipe to bridge the divide between modern foundation models and sparse retrieval objectives\.

## References

- M\. Artetxe, S\. Ruder, and D\. Yogatama \(2020\)On the cross\-lingual transferability of monolingual representations\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 4623–4637\.Cited by:[§2\.3](https://arxiv.org/html/2607.00004#S2.SS3.p1.1)\.
- Y\. Bai, X\. Li, G\. Wang, C\. Zhang, L\. Shang, J\. Xu, Z\. Wang, F\. Wang, and Q\. Liu \(2020\)SparTerm: learning term\-based sparse representation for fast text retrieval\.arXiv preprint arXiv:2010\.00768\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p2.1)\.
- P\. L\. Bartlett and S\. Mendelson \(2002\)Rademacher and gaussian complexities: risk bounds and structural results\.Journal of Machine Learning Research3,pp\. 463–482\.Cited by:[§3\.2](https://arxiv.org/html/2607.00004#S3.SS2.SSS0.Px1.p1.7)\.
- BASF\-AI \(2025\)dolma\-chem\-only\-query\-generated\.Note:Hugging Face DatasetsChemistry\-focused query\-generated subset; accessed 2025\-12\-26External Links:[Link](https://huggingface.co/datasets/BASF-AI/dolma-chem-only-query-generated)Cited by:[§6\.6](https://arxiv.org/html/2607.00004#S6.SS6.p2.1)\.
- S\. Basnet, J\. Gou, A\. Mallia, and T\. Suel \(2024\)Deeperimpact: optimizing sparse learned index structures\.arXiv preprint arXiv:2405\.17093\.Cited by:[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.18.10.1)\.
- P\. Bojanowski, E\. Grave, A\. Joulin, and T\. Mikolov \(2017\)Enriching word vectors with subword information\.Transactions of the association for computational linguistics5,pp\. 135–146\.Cited by:[§2\.3](https://arxiv.org/html/2607.00004#S2.SS3.p1.1)\.
- N\. Craswell, B\. Mitra, E\. Yilmaz, D\. Campos, J\. Lin, E\. M\. Voorhees, and I\. Soboroff \(2025\)Overview of the trec 2022 deep learning track\.arXiv preprint arXiv:2507\.10865\.Cited by:[§5\.1](https://arxiv.org/html/2607.00004#S5.SS1.SSS0.Px2.p1.1)\.
- Z\. Dai and J\. Callan \(2020\)Context\-aware term weighting for first stage passage retrieval\.InProceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval,pp\. 1533–1536\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p2.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§2\.2](https://arxiv.org/html/2607.00004#S2.SS2.p1.1)\.
- K\. Dobler and G\. de Melo \(2023\)FOCUS: effective embedding initialization for monolingual specialization of multilingual models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 13440–13454\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.829/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.829)Cited by:[§2\.3](https://arxiv.org/html/2607.00004#S2.SS3.p1.1)\.
- J\. M\. Dudek, W\. Kong, C\. Li, M\. Zhang, and M\. Bendersky \(2023\)Learning sparse lexical representations over specified vocabularies for retrieval\.InProceedings of the 32nd ACM International Conference on Information and Knowledge Management,pp\. 3865–3869\.Cited by:[§2\.3](https://arxiv.org/html/2607.00004#S2.SS3.p2.1),[§5\.2](https://arxiv.org/html/2607.00004#S5.SS2.SSS0.Px2.p1.1),[§6\.3](https://arxiv.org/html/2607.00004#S6.SS3.p1.1),[§6\.3](https://arxiv.org/html/2607.00004#S6.SS3.p2.1),[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.22.14.1)\.
- H\. El Boukkouri, O\. Ferret, T\. Lavergne, H\. Noji, P\. Zweigenbaum, and J\. Tsujii \(2020\)CharacterBERT: reconciling elmo and bert for word\-level open\-vocabulary representations from characters\.InProceedings of the 28th international conference on computational linguistics,pp\. 6903–6915\.Cited by:[§2\.2](https://arxiv.org/html/2607.00004#S2.SS2.p1.1)\.
- T\. Formal, C\. Lassance, B\. Piwowarski, and S\. Clinchant \(2021a\)SPLADE v2: sparse lexical and expansion model for information retrieval\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2109.10086),[Link](https://arxiv.org/abs/2109.10086)Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p3.1),[§3\.3](https://arxiv.org/html/2607.00004#S3.SS3.p1.4),[§5\.1](https://arxiv.org/html/2607.00004#S5.SS1.SSS0.Px2.p1.1)\.
- T\. Formal, C\. Lassance, B\. Piwowarski, and S\. Clinchant \(2022\)From distillation to hard negative sampling: making sparse neural ir models more effective\.InProceedings of the 45th international ACM SIGIR conference on research and development in information retrieval,pp\. 2353–2359\.Cited by:[Table 1](https://arxiv.org/html/2607.00004#S4.T1.4.2.5.3.1),[Table 1](https://arxiv.org/html/2607.00004#S4.T1.4.2.6.4.1),[§5\.2](https://arxiv.org/html/2607.00004#S5.SS2.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2607.00004#S5.SS3.SSS0.Px1.p2.1),[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.20.12.1),[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.21.13.1)\.
- T\. Formal, B\. Piwowarski, and S\. Clinchant \(2021b\)SPLADE: sparse lexical and expansion model for first stage ranking\.InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2288–2292\.Cited by:[§1](https://arxiv.org/html/2607.00004#S1.p1.1),[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p3.1),[§5\.1](https://arxiv.org/html/2607.00004#S5.SS1.SSS0.Px2.p1.1)\.
- L\. Gao and J\. Callan \(2021\)Condenser: a pre\-training architecture for dense retrieval\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 981–993\.Cited by:[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.14.6.1)\.
- L\. Gao, Z\. Dai, and J\. Callan \(2021\)COIL: revisit exact lexical match in information retrieval with contextualized inverted list\.arXiv preprint arXiv:2104\.07186\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p2.1)\.
- Z\. Geng, Y\. Wang, D\. Ru, and Y\. Yang \(2024\)Towards competitive search relevance for inference\-free learned sparse retrievers\.arXiv preprint arXiv:2411\.04403\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p3.1),[§3\.3](https://arxiv.org/html/2607.00004#S3.SS3.p1.4),[§5\.1](https://arxiv.org/html/2607.00004#S5.SS1.SSS0.Px2.p1.1),[§6\.5](https://arxiv.org/html/2607.00004#S6.SS5.p1.1),[Table 8](https://arxiv.org/html/2607.00004#S6.T8),[Table 8](https://arxiv.org/html/2607.00004#S6.T8.5.3.9.6.1)\.
- R\. A\. Horn and C\. R\. Johnson \(2013\)Matrix analysis\.2 edition,Cambridge University Press\.Cited by:[§3\.2](https://arxiv.org/html/2607.00004#S3.SS2.SSS0.Px1.p1.8)\.
- S\. M\. Kakade, K\. Sridharan, and A\. Tewari \(2008\)On the complexity of linear prediction: risk bounds, margin bounds, and regularization\.InAdvances in Neural Information Processing Systems 21,D\. Koller, D\. Schuurmans, Y\. Bengio, and L\. Bottou \(Eds\.\),Vancouver, Canada,pp\. 793–800\.External Links:[Link](https://proceedings.neurips.cc/paper/2008/hash/5b69b9cb83065d403869739ae7f0995e-Abstract.html)Cited by:[§2\.4](https://arxiv.org/html/2607.00004#S2.SS4.p1.2)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020a\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Online,pp\. 6769–6781\.Cited by:[§1](https://arxiv.org/html/2607.00004#S1.p1.1)\.
- V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih \(2020b\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 6769–6781\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p1.1),[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.13.5.1)\.
- A\. S\. Kasmaee, M\. Khodadad, M\. A\. Saloot, N\. Sherck, S\. Dokas, H\. Mahyar, and S\. Samiee \(2024\)ChemTEB: chemical text embedding benchmark, an overview of embedding models performance & efficiency on a specific domain\.arXiv preprint arXiv:2412\.00532\.Cited by:[§6\.6](https://arxiv.org/html/2607.00004#S6.SS6.p2.1)\.
- H\. Kim, T\. K\. Lee, and T\. Won \(2025\)The role of vocabularies in learning sparse representations for ranking\.arXiv preprint arXiv:2509\.16621\.Cited by:[§2\.3](https://arxiv.org/html/2607.00004#S2.SS3.p2.1)\.
- T\. Kudo \(2018\)Subword regularization: improving neural network translation models with multiple subword candidates\.arXiv preprint arXiv:1804\.10959\.Cited by:[§2\.2](https://arxiv.org/html/2607.00004#S2.SS2.p1.1)\.
- C\. Lassance and S\. Clinchant \(2022\)An efficiency study for splade models\.InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2220–2226\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p3.1),[§5\.1](https://arxiv.org/html/2607.00004#S5.SS1.SSS0.Px2.p1.1)\.
- C\. Lassance, H\. Déjean, T\. Formal, and S\. Clinchant \(2024\)SPLADE\-v3: new baselines for splade\.arXiv preprint arXiv:2403\.06789\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p3.1),[Table 1](https://arxiv.org/html/2607.00004#S4.T1.4.2.7.5.1),[§5\.1](https://arxiv.org/html/2607.00004#S5.SS1.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2607.00004#S5.SS2.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.23.15.1),[Table 8](https://arxiv.org/html/2607.00004#S6.T8.5.3.6.3.1)\.
- M\. Ledoux and M\. Talagrand \(1991\)Probability in banach spaces: isoperimetry and processes\.Springer\.Cited by:[Appendix B](https://arxiv.org/html/2607.00004#A2.1.p1.6)\.
- Y\. Lei, T\. Shen, Y\. Cao, and A\. Yates \(2025\)Enhancing lexicon\-based text embeddings with large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 18986–19001\.Cited by:[§2\.3](https://arxiv.org/html/2607.00004#S2.SS3.p2.1)\.
- J\. Lin and X\. Ma \(2021\)A few brief notes on deepimpact, coil, and a conceptual framework for information retrieval techniques\.arXiv preprint arXiv:2106\.14807\.Cited by:[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.16.8.1)\.
- E\. G\. Lionis, J\. Ju, A\. Nalmpantis, C\. Thuis, S\. MacAvaney, and A\. Yates \(2026\)To case or not to case: an empirical study in learned sparse retrieval\.InEuropean Conference on Information Retrieval,pp\. 512–528\.Cited by:[§2\.3](https://arxiv.org/html/2607.00004#S2.SS3.p2.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized BERT pretraining approach\.arXiv preprint arXiv:1907\.11692\.External Links:[Link](https://arxiv.org/abs/1907.11692)Cited by:[§2\.2](https://arxiv.org/html/2607.00004#S2.SS2.p1.1)\.
- A\. Mallia, O\. Khattab, T\. Suel, and N\. Tonellotto \(2021\)Learning passage impacts for inverted indexes\.InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval,Virtual Event, Canada,pp\. 1723–1727\.Cited by:[§1](https://arxiv.org/html/2607.00004#S1.p1.1),[§5\.2](https://arxiv.org/html/2607.00004#S5.SS2.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.17.9.1)\.
- C\. D\. Manning, P\. Raghavan, and H\. Schütze \(2008\)Introduction to information retrieval\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2607.00004#S1.p1.1),[§2\.4](https://arxiv.org/html/2607.00004#S2.SS4.p1.2),[§3\.1](https://arxiv.org/html/2607.00004#S3.SS1.SSS0.Px4.p1.6)\.
- A\. Martins and R\. Astudillo \(2016\)From softmax to sparsemax: a sparse model of attention and multi\-label classification\.InInternational conference on machine learning,pp\. 1614–1623\.Cited by:[§4\.2\.2](https://arxiv.org/html/2607.00004#S4.SS2.SSS2.p3.6)\.
- C\. McDiarmid \(1989\)On the method of bounded differences\.InSurveys in Combinatorics,Vol\.141,pp\. 148–188\.Cited by:[Appendix B](https://arxiv.org/html/2607.00004#A2.1.p1.6)\.
- B\. Minixhofer, F\. Paischer, and N\. Rekabsaz \(2022\)WECHSEL: effective initialization of subword embeddings for cross\-lingual transfer of monolingual language models\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 3992–4006\.Cited by:[§2\.3](https://arxiv.org/html/2607.00004#S2.SS3.p1.1)\.
- M\. Mohri, A\. Rostamizadeh, and A\. Talwalkar \(2018\)Foundations of machine learning\.2 edition,Adaptive Computation and Machine Learning,MIT Press,Cambridge, MA\.External Links:ISBN 978\-0262039406,[Link](https://mitpress.mit.edu/9780262039406/foundations-of-machine-learning/)Cited by:[Appendix B](https://arxiv.org/html/2607.00004#A2.1.p1.6),[§3\.2](https://arxiv.org/html/2607.00004#S3.SS2.SSS0.Px1.p1.7)\.
- N\. Mundra, A\. N\. K\. Khandavally, R\. Dabre, R\. Puduppully, A\. Kunchukuttan, and M\. M\. Khapra \(2024\)An empirical comparison of vocabulary expansion and initialization approaches for language models\.InProceedings of the 28th Conference on Computational Natural Language Learning,pp\. 84–104\.Cited by:[§2\.3](https://arxiv.org/html/2607.00004#S2.SS3.p1.1)\.
- F\. M\. Nardini, T\. Nguyen, C\. Rulli, R\. Venturini, and A\. Yates \(2025\)Effective inference\-free retrieval for learned sparse representations\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2936–2940\.Cited by:[Table 8](https://arxiv.org/html/2607.00004#S6.T8.5.3.8.5.1)\.
- T\. Nguyen, Y\. Lei, J\. Ju, E\. Yang, and A\. Yates \(2025\)Milco: learned sparse retrieval across languages via a multilingual connector\.arXiv preprint arXiv:2510\.00671\.Cited by:[§6\.3](https://arxiv.org/html/2607.00004#S6.SS3.p2.1)\.
- T\. Nguyen, M\. Rosenberg, X\. Song, J\. Gao, S\. Tiwary, R\. Majumder, and L\. Deng \(2016\)MS marco: a human generated machine reading comprehension dataset\.External Links:[Link](https://www.microsoft.com/en-us/research/publication/ms-marco-human-generated-machine-reading-comprehension-dataset/)Cited by:[§5\.1](https://arxiv.org/html/2607.00004#S5.SS1.SSS0.Px1.p1.1)\.
- R\. Nogueira, W\. Yang, J\. Lin, and K\. Cho \(2019\)Document expansion by query prediction\.arXiv preprint arXiv:1904\.08375\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p2.1)\.
- B\. Paria, C\. Yeh, I\. E\. Yen, N\. Xu, P\. Ravikumar, and B\. Póczos \(2020\)Minimizing flops to learn efficient sparse representations\.arXiv preprint arXiv:2004\.05665\.Cited by:[§5\.3](https://arxiv.org/html/2607.00004#S5.SS3.SSS0.Px1.p2.1)\.
- A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga,et al\.\(2019\)Pytorch: an imperative style, high\-performance deep learning library\.Advances in neural information processing systems32\.Cited by:[§5\.3](https://arxiv.org/html/2607.00004#S5.SS3.SSS0.Px3.p1.1)\.
- S\. E\. Robertson, S\. Walker, S\. Jones, M\. M\. Hancock\-Beaulieu, M\. Gatford,et al\.\(1995\)Okapi at trec\-3\.British Library Research and Development Department\.Cited by:[§1](https://arxiv.org/html/2607.00004#S1.p1.1)\.
- S\. Robertson, H\. Zaragoza,et al\.\(2009\)The probabilistic relevance framework: bm25 and beyond\.Foundations and Trends® in Information Retrieval3\(4\),pp\. 333–389\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p1.1),[§2\.4](https://arxiv.org/html/2607.00004#S2.SS4.p1.2)\.
- K\. Santhanam, O\. Khattab, J\. Saad\-Falcon, C\. Potts, and M\. Zaharia \(2022\)ColBERTv2: effective and efficient retrieval via lightweight late interaction\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 3715–3734\.Cited by:[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.15.7.1)\.
- M\. Schuster and K\. Nakajima \(2012\)Japanese and korean voice search\.In2012 IEEE international conference on acoustics, speech and signal processing \(ICASSP\),pp\. 5149–5152\.Cited by:[§2\.2](https://arxiv.org/html/2607.00004#S2.SS2.p1.1)\.
- R\. Sennrich, B\. Haddow, and A\. Birch \(2016\)Neural machine translation of rare words with subword units\.InProceedings of the 54th annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 1715–1725\.Cited by:[§2\.2](https://arxiv.org/html/2607.00004#S2.SS2.p1.1)\.
- X\. Shen, Z\. Geng, and Y\. Yang \(2025\)Exploring l0 sparsification for inference\-free sparse retrievers\.InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2572–2576\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p3.1),[§3\.3](https://arxiv.org/html/2607.00004#S3.SS3.p1.4),[§6\.5](https://arxiv.org/html/2607.00004#S6.SS5.p1.1),[§6\.5](https://arxiv.org/html/2607.00004#S6.SS5.p3.1),[Table 8](https://arxiv.org/html/2607.00004#S6.T8),[Table 8](https://arxiv.org/html/2607.00004#S6.T8.4.2.2.1),[Table 8](https://arxiv.org/html/2607.00004#S6.T8.5.3.10.7.1)\.
- N\. Thakur, N\. Reimers, A\. Rücklé, A\. Srivastava, and I\. Gurevych \(2021\)Beir: a heterogenous benchmark for zero\-shot evaluation of information retrieval models\.arXiv preprint arXiv:2104\.08663\.Cited by:[3rd item](https://arxiv.org/html/2607.00004#S1.I1.i3.p1.1),[§5\.1](https://arxiv.org/html/2607.00004#S5.SS1.SSS0.Px2.p1.1)\.
- N\. Thakur, K\. Wang, I\. Gurevych, and J\. Lin \(2023\)Sprint: a unified toolkit for evaluating and demystifying zero\-shot neural sparse retrieval\.InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2964–2974\.Cited by:[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.16.8.1),[Table 2](https://arxiv.org/html/2607.00004#S6.T2.10.8.17.9.1)\.
- B\. Warner, A\. Chaffin, B\. Clavié, O\. Weller, O\. Hallström, S\. Taghadouini, A\. Gallagher, R\. Biswas, F\. Ladhak, T\. Aarsen,et al\.\(2025\)Smarter, better, faster, longer: a modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2526–2547\.Cited by:[§1](https://arxiv.org/html/2607.00004#S1.p2.1),[§1](https://arxiv.org/html/2607.00004#S1.p5.1),[§2\.2](https://arxiv.org/html/2607.00004#S2.SS2.p1.1),[§5\.2](https://arxiv.org/html/2607.00004#S5.SS2.SSS0.Px1.p1.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz,et al\.\(2019\)Huggingface’s transformers: state\-of\-the\-art natural language processing\.arXiv preprint arXiv:1910\.03771\.Cited by:[§5\.3](https://arxiv.org/html/2607.00004#S5.SS3.SSS0.Px3.p1.1)\.
- L\. Xiong, C\. Xiong, Y\. Li, K\. Tang, J\. Liu, P\. Bennett, J\. Ahmed, and A\. Overwijk \(2020\)Approximate nearest neighbor negative contrastive learning for dense text retrieval\.arXiv preprint arXiv:2007\.00808\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p1.1)\.
- L\. Xiong, C\. Xiong, Y\. Li, K\. Tang, J\. Liu, P\. N\. Bennett, J\. Ahmed, and A\. Overwijk \(2021\)Approximate nearest neighbor negative contrastive learning for dense text retrieval\.InProceedings of the 9th International Conference on Learning Representations \(ICLR\),Note:arXiv:2007\.00808Cited by:[§1](https://arxiv.org/html/2607.00004#S1.p1.1)\.
- Y\. Zhu, R\. Kiros, R\. Zemel, R\. Salakhutdinov, R\. Urtasun, A\. Torralba, and S\. Fidler \(2015\)Aligning books and movies: towards story\-like visual explanations by watching movies and reading books\.InProceedings of the IEEE international conference on computer vision,pp\. 19–27\.Cited by:[§5\.1](https://arxiv.org/html/2607.00004#S5.SS1.SSS0.Px1.p1.1)\.
- S\. Zhuang and G\. Zuccon \(2021\)TILDE: term independent likelihood model for passage re\-ranking\.InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1483–1492\.Cited by:[§2\.1](https://arxiv.org/html/2607.00004#S2.SS1.p3.1)\.

## Appendix APer\-Seed Results for ModernBERT\-VT

Table 10\.Per\-seed results for ModernBERT\-VT\.We report BEIR nDCG10, MS MARCO MRR10/ R1k, and TREC DL\-19 nDCG10/ R1k\. The last two rows summarize mean and standard deviation over the five seeds\.To support the mean±\\pmstd statistics reported for ModernBERT\-VT in Table[2](https://arxiv.org/html/2607.00004#S6.T2), we list the raw per\-seed results in Table[10](https://arxiv.org/html/2607.00004#A1.T10)\. All five runs share identical training configurations, differing only in the random seed used for data shuffling and parameter initialization of non\-transferred components\. The seed originally reported in the main text is42; the additional four seeds \(1,2,3,4\) were run post\-hoc to quantify variance\.

## Appendix BProof of Theorem[3\.2](https://arxiv.org/html/2607.00004#S3.Thmtheorem2)

For a score classℱ\\mathcal\{F\}, write the empirical Rademacher complexityℜ^n​\(ℱ\)=𝔼σ​\[supf∈ℱ1n​∑i=1nσi​f​\(zi\)\]\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{F\}\)=\\mathbb\{E\}\_\{\\sigma\}\\\!\\big\[\\sup\_\{f\\in\\mathcal\{F\}\}\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}f\(z\_\{i\}\)\\big\]\.

###### Lemma B\.1 \(Uniform generalization for boundedLL\-Lipschitz losses\)\.

Letℓ:ℝ→\[0,1\]\\ell:\\mathbb\{R\}\\to\[0,1\]beLL\-Lipschitz andℱ\\mathcal\{F\}be any real\-valued score class\. Then for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,

supf∈ℱ\(ℒ​\(f\)−ℒ^n​\(f\)\)≤2​L​ℜ^n​\(ℱ\)\+Cgen​log⁡\(1/δ\)n,\\sup\_\{f\\in\\mathcal\{F\}\}\\Big\(\\mathcal\{L\}\(f\)\-\\hat\{\\mathcal\{L\}\}\_\{n\}\(f\)\\Big\)\\ \\leq\\ 2L\\,\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{F\}\)\\ \+\\ C\_\{\\mathrm\{gen\}\}\\sqrt\{\\tfrac\{\\log\(1/\\delta\)\}\{n\}\},for a universal constantCgen\>0C\_\{\\mathrm\{gen\}\}\>0\.

###### Proof\.

Let𝒢=\{ℓ∘f:f∈ℱ\}⊂\[0,1\]\\mathcal\{G\}=\\\{\\ell\\circ f:f\\in\\mathcal\{F\}\\\}\\subset\[0,1\]\. Standard symmetrization gives𝔼​\[supg∈𝒢\(ℙ​g−ℙ^​g\)\]≤2​ℜ^n​\(𝒢\)\\mathbb\{E\}\\big\[\\sup\_\{g\\in\\mathcal\{G\}\}\(\\mathbb\{P\}g\-\\hat\{\\mathbb\{P\}\}g\)\\big\]\\leq 2\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{G\}\), and Ledoux–Talagrand contraction yieldsℜ^n​\(𝒢\)≤L​ℜ^n​\(ℱ\)\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{G\}\)\\leq L\\,\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{F\}\)\(e\.g\.,\(Ledoux and Talagrand,[1991](https://arxiv.org/html/2607.00004#bib.bib19); Mohriet al\.,[2018](https://arxiv.org/html/2607.00004#bib.bib15)\)\)\. Since𝒢⊂\[0,1\]\\mathcal\{G\}\\subset\[0,1\], changing one sample changes the supremum by at most1/n1/n, so McDiarmid’s inequality converts the expectation bound to the stated high\-probability form, absorbing numerical constants intoCgenC\_\{\\mathrm\{gen\}\}\(McDiarmid,[1989](https://arxiv.org/html/2607.00004#bib.bib21)\)\. ∎

###### Lemma B\.2 \(RC transfer betweenVVandV′V^\{\\prime\}\)\.

Under \([RC⋆](https://arxiv.org/html/2607.00004#S3.Ex3)\) andLL\-Lipschitzℓ\\ell,

infh∈ℋV′ℒ​\(h\)≤infh∈ℋVℒ​\(h\)\+L​εRC\.\\inf\_\{h\\in\\mathcal\{H\}\_\{V^\{\\prime\}\}\}\\mathcal\{L\}\(h\)\\ \\leq\\ \\inf\_\{h\\in\\mathcal\{H\}\_\{V\}\}\\mathcal\{L\}\(h\)\\ \+\\ L\\,\\varepsilon\_\{\\mathrm\{RC\}\}\.

###### Proof\.

Fixβ≥0\\beta\\geq 0with‖β‖1≤B\\\|\\beta\\\|\_\{1\}\\leq B\. By \([RC⋆](https://arxiv.org/html/2607.00004#S3.Ex3)\), there existsβ′≥0\\beta^\{\\prime\}\\geq 0,‖β′‖1≤B\\\|\\beta^\{\\prime\}\\\|\_\{1\}\\leq Bsuch thatsup\(q,d\)\|⟨β,uθ​\(q,d\)⟩−⟨β′,uθ′​\(q,d\)⟩\|≤εRC\\sup\_\{\(q,d\)\}\|\\langle\\beta,u\_\{\\theta\}\(q,d\)\\rangle\-\\langle\\beta^\{\\prime\},u^\{\\prime\}\_\{\\theta\}\(q,d\)\\rangle\|\\leq\\varepsilon\_\{\\mathrm\{RC\}\}\. ApplyingLL\-Lipschitzℓ\\ell, taking expectations, and then taking infima over feasibleβ\\betaproves the claim\. ∎

##### Proof of Theorem[3\.2](https://arxiv.org/html/2607.00004#S3.Thmtheorem2)\.

Leth^G∈ℋV′\\hat\{h\}\_\{G\}\\in\\mathcal\{H\}\_\{V^\{\\prime\}\}be an ERM and leth⋆∈arg⁡minh∈ℋV′⁡ℒ​\(h\)h^\{\\star\}\\in\\arg\\min\_\{h\\in\\mathcal\{H\}\_\{V^\{\\prime\}\}\}\\mathcal\{L\}\(h\)\. By ERM optimality,

ℒ​\(h^G\)−ℒ​\(h⋆\)\\displaystyle\\mathcal\{L\}\(\\hat\{h\}\_\{G\}\)\-\\mathcal\{L\}\(h^\{\\star\}\)\\;≤\(ℒ​\(h^G\)−ℒ^n​\(h^G\)\)\+\(ℒ^n​\(h⋆\)−ℒ​\(h⋆\)\)\\displaystyle\\leq\\;\\big\(\\mathcal\{L\}\(\\hat\{h\}\_\{G\}\)\-\\hat\{\\mathcal\{L\}\}\_\{n\}\(\\hat\{h\}\_\{G\}\)\\big\)\+\\big\(\\hat\{\\mathcal\{L\}\}\_\{n\}\(h^\{\\star\}\)\-\\mathcal\{L\}\(h^\{\\star\}\)\\big\)\(11\)≤2​suph∈ℋV′\(ℒ​\(h\)−ℒ^n​\(h\)\)\.\\displaystyle\\leq\\;2\\sup\_\{h\\in\\mathcal\{H\}\_\{V^\{\\prime\}\}\}\\big\(\\mathcal\{L\}\(h\)\-\\hat\{\\mathcal\{L\}\}\_\{n\}\(h\)\\big\)\.
Applying Lemma[B\.1](https://arxiv.org/html/2607.00004#A2.Thmtheorem1)toℋV′\\mathcal\{H\}\_\{V^\{\\prime\}\}and using \([3](https://arxiv.org/html/2607.00004#S3.E3)\) gives, with prob\.≥1−δ\\geq 1\-\\delta,

ℒ​\(h^G\)−infh∈ℋV′ℒ​\(h\)\\displaystyle\\mathcal\{L\}\(\\hat\{h\}\_\{G\}\)\-\\inf\_\{h\\in\\mathcal\{H\}\_\{V^\{\\prime\}\}\}\\mathcal\{L\}\(h\)≤4​L​ℜ^n​\(ℋV′\)\+2​Cgen​log⁡\(1/δ\)n\\displaystyle\\leq 4L\\,\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{H\}\_\{V^\{\\prime\}\}\)\+2C\_\{\\mathrm\{gen\}\}\\sqrt\{\\tfrac\{\\log\(1/\\delta\)\}\{n\}\}\(12\)≤4BLℜ^n\(𝒲V′;∥⋅∥∞\)\+2Cgenlog⁡\(1/δ\)n\.\\displaystyle\\leq 4BL\\,\\hat\{\\mathfrak\{R\}\}\_\{n\}\(\\mathcal\{W\}\_\{V^\{\\prime\}\};\\\|\\cdot\\\|\_\{\\infty\}\)\+2C\_\{\\mathrm\{gen\}\}\\sqrt\{\\tfrac\{\\log\(1/\\delta\)\}\{n\}\}\.Finally, Lemma[B\.2](https://arxiv.org/html/2607.00004#A2.Thmtheorem2)yieldsinfh∈ℋV′ℒ​\(h\)≤infh∈ℋVℒ​\(h\)\+L​εRC\\inf\_\{h\\in\\mathcal\{H\}\_\{V^\{\\prime\}\}\}\\mathcal\{L\}\(h\)\\leq\\inf\_\{h\\in\\mathcal\{H\}\_\{V\}\}\\mathcal\{L\}\(h\)\+L\\varepsilon\_\{\\mathrm\{RC\}\}, and combining with \([12](https://arxiv.org/html/2607.00004#A2.E12)\) proves \([6](https://arxiv.org/html/2607.00004#S3.E6)\)\.∎

Similar Articles

DREAM: Dense Retrieval Embeddings via Autoregressive Modeling

Hugging Face Daily Papers

DREAM trains dense retrieval embeddings by using autoregressive language model attention to supervise query-document similarity, eliminating the need for labeled data. It consistently outperforms baselines on BEIR and RTEB benchmarks across model scales.

Xetrieval: Mechanistically Explaining Dense Retrieval

Hugging Face Daily Papers

Xetrieval is a mechanistic framework that explains dense retrieval by enhancing sentence embeddings with reasoning information and decomposing them into interpretable sparse features, providing feature-level explanations for retrieval decisions without expensive autoregressive generation.