m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
Summary
This paper introduces m3BERT, a multilingual bidirectional encoder with a novel pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions, enabling a single model to be adapted to varied resource constraints. It significantly outperforms state-of-the-art models on the Bing-Click industrial retrieval dataset.
View Cached Full Text
Cached at: 05/20/26, 08:26 AM
# m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
Source: [https://arxiv.org/html/2605.19568](https://arxiv.org/html/2605.19568)
\(2018\)
###### Abstract\.
Embedding models are pivotal in industrial information retrieval systems like search and advertising\. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business\-driven constraints\. A common practice involves fine\-tuning with partial parameter initialization from larger pretrained models for resource\-constrained tasks\. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits\. To address this limitation, we introducem3BERT\\text\{m\}^\{3\}\\text\{BERT\}: aModern,Multi\-lingual,Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions\. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining\. Incorporating recent architectural improvements,m3BERT\\text\{m\}^\{3\}\\text\{BERT\}uses a three\-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval\.m3BERT\\text\{m\}^\{3\}\\text\{BERT\}significantly outperforms state\-of\-the\-art embedding models inBing\-Click, a large\-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource\-aware industrial retrieval systems\. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy\.
information retrieval, language model, web search
††copyright:acmlicensed††journalyear:2018††doi:XXXXXXX\.XXXXXXX††conference:Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY††isbn:978\-1\-4503\-XXXX\-X/2018/06††ccs:Information systems Language models††ccs:Information systems Web and social media search## 1\.Introduction
The embedding model is a fundamental component in information retrieval, especially encoder\-only transformer architectures\(Patilet al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib29)\), which are widely used in commercial applications such as search engines, advertising platforms, retrieval\-augmented generation\(Xianget al\.,[2025](https://arxiv.org/html/2605.19568#bib.bib33)\), and agent systems\(Gaoet al\.,[2025](https://arxiv.org/html/2605.19568#bib.bib34)\)\. In these domains, pretraining on large\-scale datasets followed by supervised fine\-tuning with domain\-specific data has proven to be an effective approach for building robust embedding models\.
Figure 1\.Illustrative curves showing the diminishing returns of retrieval performance \(Recall@100\) with increasing \(a\) embedding dimension and \(b\) number of transformer layers for an mBERT model onBing\-Click\. While performance generally improves, the gains diminish significantly at higher dimensions/depths, whereas computational costs \(e\.g\., forward latency, embedding memory\) often scale linearly\.Existing pretrained embedding models\(Kenton and Toutanova,[2019](https://arxiv.org/html/2605.19568#bib.bib1); Liuet al\.,[2019](https://arxiv.org/html/2605.19568#bib.bib2); Wanget al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib3); Warneret al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib5)\)provide powerful general\-purpose embeddings and often incorporate training strategies and architectural innovations to enhance performance\. However, a significant limitation of these models lies in their rigid design: both the model size and the dimension of the generated embeddings are fixed\. This rigidity makes it challenging to meet the diverse requirements of downstream tasks, which may demand varying trade\-offs between retrieval performance, latency, and computational resources\. Figure[1](https://arxiv.org/html/2605.19568#S1.F1)illustrates this challenge: while increasing embedding dimension or model depth \(number of layers\) generally improves retrieval performance, the gains often diminish significantly at higher values\. For instance, doubling the embedding dimension from 64 to 128 might yield a substantial performance boost, but further increasing it to 768 may offer only marginal improvements while linearly increasing retrieval latency and storage costs\.
This characteristic of diminishing return underscores the inefficiency of a one\-size\-fits\-all approach in embedding model design\. Tasks with strict latency constraints or those deployed on resource\-constrained hardware may thus require smaller models and lower\-dimensional embeddings to operate efficiently, even if it means a slight compromise on performance\. To obtain these smaller variants from an existing large pre\-trained model, a common practice is to construct a smaller architecture and initialize its parameters by copying a corresponding weights from the large model\. However, this approach of partial parameter initialization can fail to fully leverage the benefits of pre\-training due to the misalignment between pretraining and downstream fine\-tuning, thereby limiting the performance gains\.
To address this challenge, we draw inspiration from Matryoshka Representation Learning \(MRL\)\(Kusupatiet al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib6)\), which enables flexibility in embedding representations by training them at multiple dimensions\. Extending this concept, we propose a novel pre\-training strategy that incorporates multigranular embedding representations across both transformer layers and embedding dimensions\. By jointly optimizing diverse embeddings along these two axes during pretraining, our approach equips the model with flexible and robust representations, making it a stronger foundation for fine\-tuning across a broad range of downstream tasks requiring different model sizes and embedding dimensions\.
Building upon this strategy, we presentm3BERT\\text\{m\}^\{3\}\\text\{BERT\}, aModern,Multi\-lingual,Matryoshka Bidirectional Encoder, designed for flexible adaptation to varying deployment scenarios\. While current widely used embedding models\(Wanget al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib3); Multi\-Granularity,[2024](https://arxiv.org/html/2605.19568#bib.bib19);[OrdalieTech,](https://arxiv.org/html/2605.19568#bib.bib20)\)predominantly adopt the traditional BERT architecture, recent advancements in language model design\(Touvronet al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib8); Junget al\.,[2010](https://arxiv.org/html/2605.19568#bib.bib18);[Teamet al\.,](https://arxiv.org/html/2605.19568#bib.bib17)\)have indicated potential sub\-optimality in certain aspects of this structure\. Consequently, we incorporate recent architectural improvements from large language models into ourm3BERT\\text\{m\}^\{3\}\\text\{BERT\}\. Our pre\-training process comprises three distinct stages: monolingual pretraining, multilingual pretraining, and continual pretraining on a large\-scale web domain corpus\. This three\-stage approach is specifically designed to enhance the model’s performance in real\-world retrieval tasks\.
Extensive experiments demonstrate the effectiveness and robustness of our approach\. On our collected large\-scale industrial dataset BINGCLICK,m3BERT\\text\{m\}^\{3\}\\text\{BERT\}consistently outperforms current state\-of\-the\-art embedding models across various model sizes and embedding dimensions\. To further validate the generalizability of our matryoshka pre\-training strategy, we conduct evaluations on multiple public datasets, showing that its benefits extend beyond our specific industrial use case\. Additionally, we explore a novel self\-distillation technique, termed Matryoshka Distillation, which leverages the nested structure of our model to further enhance the performance of smaller embeddings during the pre\-training phase\.
The effectiveness ofm3BERT\\text\{m\}^\{3\}\\text\{BERT\}is validated by its large\-scale deployment at Bing Search\. Since June 2025,m3BERT\\text\{m\}^\{3\}\\text\{BERT\}has been a core component of the production query\-keyword selection workflow, consistently handling over 25,000 queries per second \(QPS\)\. This deployment, which contributes an annualized revenue impact of approximately USD 50 million, demonstrates the tangible business value and engineering success of our model design\.
Our main contributions can be summarized as follows:
- •We introduce the Matryoshka Representation Learning paradigm into the pretraining of embedding models and extend it to encompass multigranular representations across both transformer layers and embedding dimensions\. This innovation enables a single pretrained model to effectively align with the varying deployment scenarios\.
- •We pretrainm3BERT\\text\{m\}^\{3\}\\text\{BERT\}, a novel model that incorporates modern architectural advancements\. We leverage a massive, multi\-stage pre\-training process, including a large\-scale web domain corpus, to specifically tailor the model for superior performance in industrial retrieval applications\.
- •Our model not only outperforms state\-of\-the\-art baselines in extensive offline experiments but has also been proven under live traffic at Bing, delivering substantial business impact with an annualized revenue of USD 50 million\.
## 2\.m3BERT\\text\{m\}^\{3\}\\text\{BERT\}
Figure 2\.Overview of the matryoshka model structure using masked language modeling \(MLM\) as the training objective\. The model simultaneously optimizes embeddings across multiple transformer layers and multiple sub\-dimensions\.### 2\.1\.Matryoshka Pretraining
To enable flexible and efficient embedding across various downstream tasks, we propose a multigranular embedding pretraining strategy\. Unlike traditional pre\-training approaches that rely solely on the final layer and full\-dimensional embeddings, our method jointly optimizes embeddings across multiple transformer layers and multiple sub\-dimensions\.
Formally, let a transformer encoder haveNNtotal layers, producing hidden states\{hk\}k=1N\\\{h^\{k\}\\\}\_\{k=1\}^\{N\}\. Each full hidden state from a layerkkishk∈ℝs×Mh^\{k\}\\in\\mathbb\{R\}^\{s\\times M\}, wheressis the sequence length, andMMis the full embedding dimension\. To accommodate diverse deployment scenarios, we predefine a set of selected layer indicesL⊆\{1,…,N\}L\\subseteq\\\{1,\\ldots,N\\\}and a set of target sub\-dimensionsD⊆\{1,…,M\}D\\subseteq\\\{1,\\ldots,M\\\}\. For each selected layerli∈Ll\_\{i\}\\in Land each sub\-dimensiondj∈Dd\_\{j\}\\in D, we extract the truncated embedding from the output of layerlil\_\{i\}:
h\[:dj\]li∈ℝs×djh^\{l\_\{i\}\}\_\{\[:d\_\{j\}\]\}\\in\\mathbb\{R\}^\{s\\times d\_\{j\}\}whereh\[:dj\]lih^\{l\_\{i\}\}\_\{\[:d\_\{j\}\]\}denotes taking the firstdjd\_\{j\}dimensions ofhlih^\{l\_\{i\}\}\(the fullMM\-dimensional hidden state output by layerlil\_\{i\}\)\. This process results in\|L\|×\|D\|\|L\|\\times\|D\|distinct embedding representations that are jointly optimized during training\.
For masked language modeling \(MLM\) training, we reuse a shared MLM head\. The MLM head consists of a projection matrixW∈ℝM×VW\\in\\mathbb\{R\}^\{M\\times V\}and a biasb∈ℝVb\\in\\mathbb\{R\}^\{V\}, whereVVis the vocabulary size\. When using a truncated embeddingh\[:dj\]lih^\{l\_\{i\}\}\_\{\[:d\_\{j\}\]\}, we utilize the corresponding submatrixW\[:dj,:\]∈ℝdj×VW\_\{\[:d\_\{j\},:\]\}\\in\\mathbb\{R\}^\{d\_\{j\}\\times V\}to project the embedding into the vocabulary space:
y^li,dj=Softmax\(h\[:dj\]liW\[:dj,:\]\+b\)\\hat\{y\}^\{l\_\{i\},d\_\{j\}\}=\\text\{Softmax\}\\left\(h^\{l\_\{i\}\}\_\{\[:d\_\{j\}\]\}W\_\{\[:d\_\{j\},:\]\}\+b\\right\)
The total loss is computed by aggregating the MLM losses across all selected \(layer, sub\-dimension\) pairs:
ℒtotal=∑li∈L∑dj∈DℒMLM\(y^li,dj,y\)\\mathcal\{L\}\_\{\\text\{total\}\}=\\sum\_\{l\_\{i\}\\in L\}\\sum\_\{d\_\{j\}\\in D\}\\mathcal\{L\}\_\{\\text\{MLM\}\}\(\\hat\{y\}^\{l\_\{i\},d\_\{j\}\},y\)whereyydenotes the ground\-truth tokens at masked positions\.
By optimizingℒtotal\\mathcal\{L\}\_\{\\text\{total\}\}, the model learns robust and flexible representations that are consistent and predictive across multiple granularities of layers and embedding dimensions\. This design ensures that practitioners can dynamically select different model depths and embedding widths to balance the trade\-offs between latency, memory, and performance\.
### 2\.2\.Modern Architecture
Building upon the original BERT architecture, we incorporate several recent advancements from large language models \(LLMs\) to enhance training and inference efficiency, improve training stability, and boost overall model performance\.
#### Activation Function\.
We replace the GeLU activation of BERT with SwiGLU\(Shazeer,[2020](https://arxiv.org/html/2605.19568#bib.bib7)\), which has demonstrated better performance in transformer\-based models\(Warneret al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib5); Touvronet al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib8); Groeneveldet al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib9)\)\. The SwiGLU activation introduces a gated mechanism, providing stronger non\-linearity and facilitating more effective feature learning\.
#### Normalization\.
Following recent trends in LLM\(Touvronet al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib8); Baiet al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib10)\), we adopt the root mean square layer normalization \(RMSNorm\)\(Zhang and Sennrich,[2019](https://arxiv.org/html/2605.19568#bib.bib11)\)instead of the standard LayerNorm\. Additionally, we transition from post\-norm configuration to a pre\-norm design, which stabilizes training and improves gradient flow in deep transformers\(Wanget al\.,[2019](https://arxiv.org/html/2605.19568#bib.bib12)\)\.
#### Bias Terms and Dropout\.
Recent studies\(Warneret al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib5); Touvronet al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib8)\)indicate that bias terms contribute little to model performance while introducing unnecessary computational overhead\. We remove the bias terms in both the self\-attention and feed\-forward layers\. Furthermore, we eliminate dropout applied to the hidden states between transformer layers, as its removal has been shown to maintain performance while simplifying the model and accelerating training\.
#### Flash Attention\.
We replace the standard attention computation with FlashAttention\(Daoet al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib13); Dao,[2023](https://arxiv.org/html/2605.19568#bib.bib14)\), an efficient algorithm that leverages tiling and memory\-efficient strategies to significantly accelerate attention operations without compromising numerical precision\.
By integrating these architectural refinements, our model inherits the proven strengths of BERT while benefiting from the efficiency and scalability improvements pioneered in recent LLMs\.
### 2\.3\.Three\-Stage Pretraining
To equip the model with broad linguistic understanding, cross\-lingual capabilities, and domain\-specific knowledge, we adopt a three\-stage pretraining strategy\.
#### Stage 1: Monolingual Pretraining\.
We first conduct pretraining on a large English corpus consisting of 100 billion tokens from Nemotron\-CC\(Suet al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib15)\)using the masked language modeling \(MLM\) objective\. Although previous work\(Wettiget al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib31)\)suggests that increasing the masking rate to0\.30\.3can benefit the training of larger models, we observe that in our setting—which emphasizes learning from reduced layers and sub\-dimensional embeddings—a high masking rate leads to unstable training and convergence issues\. Therefore, we adopt a standard masking rate of0\.150\.15to balance learning difficulty and training stability\.
#### Stage 2: Multilingual Pretraining\.
To extend the model’s capabilities to multilingual scenarios, we continue pretraining on a corpus constructed from the Wikipedia dumps of the top 100 languages, totaling approximately 20 billion tokens\. Following the general practice introduced by multilingual BERT\(Kenton and Toutanova,[2019](https://arxiv.org/html/2605.19568#bib.bib1)\), we apply an exponential smoothing strategy to balance the contribution of different languages\. Specifically, ifP\(L\)P\(L\)denotes the original data proportion for languageLL, the sampling probability is adjusted asP′\(L\)∝P\(L\)SP^\{\\prime\}\(L\)\\propto P\(L\)^\{S\}with a smoothing factorS=0\.7S=0\.7, followed by normalization\. This approach mitigates the imbalance between high\-resource and low\-resource languages by relatively downsampling dominant languages like English and upsampling underrepresented languages\.
#### Stage 3: Web Domain Pretraining\.
To specialize the model for ads and web search\-related applications, we adapt our multigranular pretraining strategy using an infinite contrastive learning objective \(Inf\-CL\)\(Chenget al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib16)\)on in\-domain data\. Inf\-CL facilitates training with exceptionally large batch sizes by utilizing a tile\-based computation strategy\. This approach partitions the contrastive loss calculation, thereby avoiding the full materialization of the similarity matrix and overcoming typical GPU memory constraints associated with scaling contrastive learning\. The training corpus consists of 10 billion query\-document \(Q\-Doc\) pairs collected from one month of user\-ads click logs\. Each pair includes ad features \(e\.g\., keyword, ad titles and ad descriptions\), and URL features \(e\.g\., landing page titles and contents\)\. We apply extensive data cleaning, including Q\-Doc deduplication and query\-level sampling caps\. Pretraining continues from the Stage 2 checkpoint under the same multigranular learning settings, with batch sizes scaled up to one million samples\. The model is trained for 2 trillion tokens over the corpus\.
#### Training Details\.
For our model size, we adopt the same configuration as mBERT\-base111[https://huggingface\.co/google\-bert/bert\-base\-multilingual\-uncased](https://huggingface.co/google-bert/bert-base-multilingual-uncased)and mE5\-base222[https://huggingface\.co/intfloat/multilingual\-e5\-base](https://huggingface.co/intfloat/multilingual-e5-base), utilizing 12 transformer layers and a hidden size of 768\. In Stage 1 and Stage 2, we adopt masked language modeling \(MLM\) as the training objective\. For the multigranular optimization, we select a layer setL=\{4,8,12\}L=\\\{4,8,12\\\}and a dimension setD=\{32,64,128,768\}D=\\\{32,64,128,768\\\}\. These granularities are specifically chosen to align with the hierarchical deployment tiers and hardware constraints in Bing’s production infrastructure, where different retrieval stages must adhere to latency budgets and memory limits\. The learning rate was set to1×10−41\\times 10^\{\-4\}with a batch size of 128K sequences, each with a maximum length of 512 tokens\. In Stage 3, we switch to the Infinite contrastive learning \(Inf\-CL\) objective to better align with the web domain data, increasing the batch size to 1 million samples and setting the learning rate to2×10−42\\times 10^\{\-4\}\. For this stage, considering computational efficiency and practical application needs, the multigranular optimization utilizes a layer setL=\{4,12\}L=\\\{4,12\\\}and a dimension setD=\{32,64,128\}D=\\\{32,64,128\\\}\. Across all stages, we employ the AdamW optimizer\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.19568#bib.bib30)\)withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999, a weight decay of0\.010\.01, and a cosine learning rate decay schedule with 10,000 warm\-up steps\. To enhance computational efficiency, bfloat16 mixed\-precision training was utilized\. The entire pretraining process was conducted on a cluster of NVIDIA A100 GPUs, accumulating approximately 40,000 GPU hours\.
## 3\.Experiments
This section presents a comprehensive empirical evaluation of our proposed framework\. In Section[3\.1](https://arxiv.org/html/2605.19568#S3.SS1), we detail the experimental setup, including our large\-scale industrial datasetBing\-Click\. Section[3\.2](https://arxiv.org/html/2605.19568#S3.SS2)showcases the main results onBing\-Click, demonstrating thatm3BERT\\text\{m\}^\{3\}\\text\{BERT\}achieves state\-of\-the\-art performance in this industrial retrieval scenario\. To validate the core of our method, Section[3\.3](https://arxiv.org/html/2605.19568#S3.SS3)isolates the impact of the multigranular matryoshka pretraining strategy by comparing it against standard pretraining across multiple datasets\. Finally, Section[3\.4](https://arxiv.org/html/2605.19568#S3.SS4)discusses the successful deployment of our model, highlighting its operational scale and significant business impact\.
Table 1\.Main retrieval performance \(Recall@100 and Recall@1000\) on theBing\-Clicktest set\. Results are grouped by model configuration \(Lite or Full layers\)\. Best performing figures for each metric and configuration are inbold\.m3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}is not evaluated at 768 dim due to its matryoshka pretraining focus up to 128 dim\.ModelDim=32Dim=64Dim=128Dim=768R@100R@1000R@100R@1000R@100R@1000R@100R@1000Lite \(1/3 layers\)mBERT57\.4680\.7862\.5084\.6764\.5886\.2765\.3986\.99mE557\.1880\.6362\.1684\.4064\.4086\.1465\.3986\.90ModernBERT57\.1080\.5562\.4384\.5964\.5986\.2365\.3486\.87m3BERT\-s2\\text\{m\}^\{3\}\\text\{BERT\-s2\}58\.3481\.6763\.4385\.4965\.4787\.0266\.3387\.66m3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}62\.0684\.5266\.4787\.6767\.7288\.51––FullmBERT59\.7482\.8364\.6386\.2866\.7687\.6967\.3388\.12mE559\.3182\.2464\.3285\.9866\.8587\.3967\.9788\.49ModernBERT59\.4482\.6564\.6286\.2766\.7587\.7167\.5088\.25m3BERT\-s2\\text\{m\}^\{3\}\\text\{BERT\-s2\}60\.2583\.2965\.4487\.0067\.3488\.3768\.3289\.00m3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}63\.3085\.5067\.6088\.3568\.9489\.23––### 3\.1\.Setup
#### Datasets
To evaluate the effectiveness ofm3BERT\\text\{m\}^\{3\}\\text\{BERT\}in real\-world industrial retrieval scenarios, we collected and curated a large\-scale multilingual industrial text retrieval dataset, which we nameBing\-Click\. This dataset comprises a substantial volume of user queries and their corresponding clicked webpages, gathered from the Bing search engine across diverse languages and regions worldwide\. Specifically, a page is considered a positive instance for a given user query if the user clicked on it and subsequently remained for ten seconds, indicating relevance\. We amassed six months of such interaction data, resulting in 100 million query\-page pairs after being filtered for training\. For evaluation, we constructed a test set using the same methodology in another three months, ensuring that the click timestamps for the test data chronologically followed those of the training data to prevent data leakage and simulate a realistic deployment scenario\. The test set is designed to emulate a large\-scale retrieval system, consisting of a candidate pool of 10 million unique documents and 1 million positive query\-document pairs\. This massive scale and the temporal separation provide a high\-fidelity proxy for the model’s robustness against evolving user trends in an online environment\. To further validate the generalizability of matryoshka Pretraining beyond the industrial setting, we also conduct experiments on three public benchmarks:MS MARCOdocument ranking\(Bajajet al\.,[2018](https://arxiv.org/html/2605.19568#bib.bib47)\),Natural Questions\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2605.19568#bib.bib48)\), andTREC\-COVID\(Voorheeset al\.,[2020](https://arxiv.org/html/2605.19568#bib.bib46)\)\. More training details can be found in Appendix A\.1\.
#### Models
Our proposed models are evaluated in two configurations based on their pretraining:
- •m3BERT\-s2\\text\{m\}^\{3\}\\text\{BERT\-s2\}: This model corresponds to the checkpoint obtained after Stage 1 \(English pretraining\) and Stage 2 \(multilingual adaptation\), primarily trained with the multigranular MLM objective\.
- •m3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}: This model is the result of Stage 3, which is continually pretrained on the large\-scale web domain corpus using the Inf\-CL objective, also incorporating the multigranular learning strategy for layers and dimensions\.
We comparem3BERT\\text\{m\}^\{3\}\\text\{BERT\}against commonly used baseline embedding models of similar parameter scales to ensure a fair and meaningful comparison\. These include:
- •mBERT\(Kenton and Toutanova,[2019](https://arxiv.org/html/2605.19568#bib.bib1)\): We use bert\-base\-multilingual, a standard multilingual BERT model pretrained with masked language modeling and next sentence prediction objectives\.
- •mE5\(Wanget al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib3)\): We use multilingual\-e5\-base, a multilingual version of the E5 model designed for dense retrieval\.
- •ModernBERT\(Warneret al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib5)\): We use ModernBERT\-base, incorporating modern architectural improvements, pretrained with MLM and a contrastive learning objective\.
We do not include significantly larger models such asbge\-m3\(Multi\-Granularity,[2024](https://arxiv.org/html/2605.19568#bib.bib19)\)\. To the best of our knowledge, embedding models used in some large\-scale commercial scenarios are in the “lite” or base\-size\.
To assess performance under stricter resource constraints and to leverage the matryoshka capabilities, we evaluate ”lite” versions of all models, which utilize the first 1/3 of their total transformer layers\. For embedding dimension, we report results for 32, 64, 128, and 768 dimensions\. A notable exception ism3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}; due to its multigranular pretraining in Stage 3 being optimized for matryoshka embeddings up to 128 dimensions, we evaluate it for 32, 64, and 128 dimensions\.
#### Supervised Finetuning
All models are supervised fine\-tuned \(SFT\) using a contrastive learning objective with in\-batch negatives\. Given a queryqq, its corresponding positive documentd\+d^\{\+\}, and a set ofN−1N\-1negative documents\{di−\}i=1N−1\\\{d^\{\-\}\_\{i\}\\\}\_\{i=1\}^\{N\-1\}from the same batch, we first define the scaled similarity score ass\(q,d\)=sim\(E\(q\),E\(d\)\)/τs\(q,d\)=\\text\{sim\}\(E\(q\),E\(d\)\)/\\tau\. Here,E\(⋅\)E\(\\cdot\)is the embedding function \(the model\),sim\(⋅,⋅\)\\text\{sim\}\(\\cdot,\\cdot\)is the cosine similarity, and the temperatureτ\\tauis set to0\.050\.05\. The contrastive loss for the query is then formulated as:
\(1\)ℒSFT=−logexp\(s\(q,d\+\)\)exp\(s\(q,d\+\)\)\+∑i=1N−1exp\(s\(q,di−\)\)\\mathcal\{L\}\_\{\\text\{SFT\}\}=\-\\log\\frac\{\\exp\(s\(q,d^\{\+\}\)\)\}\{\\exp\(s\(q,d^\{\+\}\)\)\+\\sum\_\{i=1\}^\{N\-1\}\\exp\(s\(q,d^\{\-\}\_\{i\}\)\)\}
ForBing\-Click, we use a batch size of 1024 and a learning rate of5×10−45\\times 10^\{\-4\}\. ForMS MARCO,Natural Questions, andTREC\-COVID, we use a batch size of 32 and a learning rate of2×10−52\\times 10^\{\-5\}\. The models are trained on 8 NVIDIA A100 GPUs using the AdamW optimizer\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.19568#bib.bib30)\)withfp16mixed\-precision training\. Results are averaged over three runs\. More training details can be found in Appendix A\.1\.
### 3\.2\.Main Results
We evaluate Recall@100 and Recall@1000 of all models on theBing\-Clicktest set\. The comprehensive results, comparingm3BERT\-s2\\text\{m\}^\{3\}\\text\{BERT\-s2\}andm3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}against the baselines across various layer configurations \(full and lite\) and embedding dimensions \(32, 64, 128, 768\), are presented in Table[1](https://arxiv.org/html/2605.19568#S3.T1)\.
The results consistently demonstrate the superiority ofm3BERT\\text\{m\}^\{3\}\\text\{BERT\}framework\. Across all evaluated configurations, bothm3BERT\-s2\\text\{m\}^\{3\}\\text\{BERT\-s2\}andm3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}significantly outperform the baseline models\. This advantage is evident not only when using the full model depth and standard 768 embedding dimension but also extends robustly to the “lite” models \(1/3 layers\) and lower matryoshka embedding dimensions \(32, 64, 128\)\.
Particularly noteworthy is the performance ofm3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}\. In our experiments, we observed thatm3BERT\-s2\\text\{m\}^\{3\}\\text\{BERT\-s2\}had already reached a state of convergence on the general\-purpose multilingual corpora from Stage 2, and further pretraining on the same data yielded negligible gains\. The continued pretraining on the large\-scale web domain corpus \(Stage 3\) provides a substantial boost in retrieval effectiveness, showcasing the benefit of domain\-specific adaptation combined with our multigranular pretraining strategy\. Even at very low dimensions like 32 or 64,m3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}surpasses or performs competitively with baseline models using much larger embeddings, highlighting the efficiency and representational power instilled by our approach\. The consistent gains across the spectrum of model sizes and embedding dimensions underscore the flexibility and strong generalization capabilities ofm3BERT\\text\{m\}^\{3\}\\text\{BERT\}, making it a highly effective foundation for industrial applications with varying resource constraints\.
Table 2\.Recall@100 comparison of Matryoshka vs\. Standard pretraining across datasets and embedding dimensions\.DatasetStrategyD=32D=64D=128D=768Bing\-ClickStandard57\.1962\.5064\.9565\.55Matryoshka57\.5362\.7964\.6265\.44MS MARCOStandard50\.0857\.9861\.2563\.91Matryoshka51\.5558\.6361\.2963\.28NaturalQuestionsStandard81\.1788\.3691\.5691\.80Matryoshka83\.7488\.4991\.0991\.32TREC\-COVIDStandard76\.0783\.5886\.9288\.07Matryoshka79\.2384\.1486\.3387\.36
### 3\.3\.Impact of Matryoshka Pretraining
To assess the direct benefits of our multigranular matryoshka pretraining strategy, we compare it against standard pretraining approaches where models are individually optimized for specific layer depths and embedding dimensions\. We conduct two main pretraining paradigms for 100,000 steps using the Stage\-1 English corpus, identical hyperparameters and architecture:
- •Standard Pretraining: For this paradigm, we train separate models\. Each model is pretrained using a standard MLM objective where the loss is computed \*only\* from a specific target layer \(the 4th\) and for a specific target full\-dimension \(32, 64, 128, or 768\)\. This results in 4 distinct pretrained models\.
- •Matryoshka Pretraining: Our proposed approach, where a single model is pretrained\. The MLM loss is aggregated across multiple selected layers \(including the 4th and 12th\) and multiple sub\-dimensions \(32, 64, 128, and 768\), as described in Section[2\.1](https://arxiv.org/html/2605.19568#S2.SS1)\.
All pretrained models adopt them3BERT\\text\{m\}^\{3\}\\text\{BERT\}architecture and are independently fine\-tuned \(SFT\) on each downstream dataset:Bing\-Click,MS MARCO,Natural Questions, andTREC\-COVID\. For all of the standard pretrained models, SFT targets the specific layer and dimension they were pretrained for\. For the Matryoshka model, SFT also targets specific layer/dimension combinations for evaluation, but benefits from the joint pretraining\.
As shown in Table[2](https://arxiv.org/html/2605.19568#S3.T2), the single matryoshka\-pretrained model achieve comparable performance to individually pretrained counterparts\. Notably, while performance at the highest dimension may show minor reductions in some datasets, matryoshka pretraining consistently improves retrieval quality at lower dimensions\. This indicates that matryoshka pretraining enhances the flexibility of representation learning, particularly benefiting resource\-constrained deployment scenarios where compact embeddings are crucial\.
### 3\.4\.Deployment
The model was mainstreamed in 2025 as a core component of the query–keyword online selection workflow, where it consistently handles 25,000 queries per second \(QPS\) under production traffic\. To support this scale, we deployed a large\-scale document index containing over 120 million keywords, enabling coverage across diverse query intents and verticals\. Retrieval is powered by an approximate nearest neighbor \(ANN\) search framework, which efficiently narrows down the search space and returns the top 100 most relevant keywords for each incoming query with low latency\. From a business perspective, this deployment has proven to be highly impactful\. By integrating the model directly into the serving stack, we improved both the efficiency and precision of query–keyword matching, driving measurable gains in downstream ad relevance and monetization\. The system currently contributes an annualized revenue impact \(APR\) of approximately USD 50 million, establishing it as one of the most valuable model deployments in the query understanding pipeline\.
## 4\.Analysis
In this section, we analyze the contributions of different components ofm3BERT\\text\{m\}^\{3\}\\text\{BERT\}and explore more matryoshka strategies\. Section[4\.1](https://arxiv.org/html/2605.19568#S4.SS1)shows the impact of various modern architectural components on model performance\. Section[4\.2](https://arxiv.org/html/2605.19568#S4.SS2)explores how leveraging higher\-dimensional representations as teacher signals can enhance the performance of lower\-dimensional embeddings during matryoshka pretraining\. Section[4\.3](https://arxiv.org/html/2605.19568#S4.SS3)demonstrates the advantages of using models pretrained with matryoshka strategy in conjunction with matryoshka supervised fine\-tuning\.
Table 3\.Matryoshka Supervised Finetuning \(MRL SFT\) performance onBing\-Click\(Lite models\)\. Parenthesized values denote the performance drop \(Δ\\Delta\) compared to standard single\-dimension finetuning\.Boldindicates the smallest performance drop in each column\.Model \(Lite\)Dim=32Dim=64Dim=128Dim=768R@100R@1000R@100R@1000R@100R@1000R@100R@1000mBERT57\.12 \(\-0\.34\)80\.66 \(\-0\.12\)62\.29 \(\-0\.21\)84\.59 \(\-0\.08\)63\.87 \(\-0\.71\)85\.87 \(\-0\.40\)64\.26 \(\-1\.13\)86\.20 \(\-0\.79\)mE556\.91 \(\-0\.27\)80\.33 \(\-0\.30\)61\.87 \(\-0\.29\)84\.20 \(\-0\.20\)63\.44 \(\-0\.96\)85\.42 \(\-0\.72\)63\.99 \(\-1\.40\)85\.91 \(\-0\.99\)ModernBERT56\.34 \(\-0\.76\)80\.24 \(\-0\.31\)62\.29 \(\-0\.14\)84\.65\(\+0\.06\)63\.80 \(\-0\.79\)85\.79 \(\-0\.44\)64\.11 \(\-1\.23\)86\.00 \(\-0\.87\)m3BERT\-s2\\text\{m\}^\{3\}\\text\{BERT\-s2\}58\.25\(\-0\.09\)81\.59\(\-0\.08\)63\.52\(\+0\.09\)85\.51 \(\+0\.02\)65\.01 \(\-0\.46\)86\.64 \(\-0\.38\)65\.36\(\-0\.97\)86\.96\(\-0\.70\)m3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}61\.73 \(\-0\.33\)84\.30 \(\-0\.22\)66\.45 \(\-0\.02\)87\.56 \(\-0\.11\)67\.39\(\-0\.33\)88\.19\(\-0\.32\)––
Table 4\.Architecture ablation form3BERT\\text\{m\}^\{3\}\\text\{BERT\}onBing\-Click\.Model ConfigurationR@100R@1000m3BERT\\text\{m\}^\{3\}\\text\{BERT\}61\.6584\.13– SwiGLU60\.1882\.79– Pre\-norm61\.4284\.01– RMSNorm61\.6684\.12\+ Hidden Dropout61\.6584\.14\+ Bias Terms61\.5384\.12### 4\.1\.Architecture Ablation
To quantify the impact of the modern architectural modifications incorporated intom3BERT\\text\{m\}^\{3\}\\text\{BERT\}, we perform an ablation study\. Starting with ourm3BERT\-s2\\text\{m\}^\{3\}\\text\{BERT\-s2\}model in a lite configuration \(1/3 layers\) and using 64\-dimensional embeddings, we systematically revert each enhancement to its counterpart in traditional BERT or E5 models\. Specifically, we evaluate the following variations:
- •Replacing SwiGLU with GeLU activation \(w/o SwiGLU\)\.
- •Reverting from pre\-normalization to post\-normalization \(w/o Pre\-norm\)\.
- •Replacing RMSNorm with standard LayerNorm \(w/o RMSNorm\)\.
- •Re\-introducing dropout on hidden states between transformer layers \(w/ Dropout\)\.
- •Re\-introducing the bias terms in self\-attention and feed\-forward layers \(w/ Bias\)\.
Each ablated model is trained on theBing\-Clickdataset without pretraining\. The results, presented in Table[4](https://arxiv.org/html/2605.19568#S4.T4), show that while some architectural choices positively impact the final performance, others exhibit negligible effects on accuracy but, as noted by\(Touvronet al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib8)\), contribute to computational efficiency or training stability, justifying their inclusion inm3BERT\\text\{m\}^\{3\}\\text\{BERT\}\.
### 4\.2\.Matryoshka Distillation
Given the nature of matryoshka pretraining, which simultaneously optimizes embeddings across multiple layers and dimensions, we explore leveraging deeper, higher\-dimensional embeddings as ”teacher” signals for shallower, lower\-dimensional embeddings during the pretraining process itself\. This can be viewed as a form of self\-distillation inherent to the matryoshka framework\. To investigate this, we take them3BERT\\text\{m\}^\{3\}\\text\{BERT\}model checkpoint from Stage 1 after 300,000 pretraining steps as our base\. We then continue pretraining for an additional 50,000 steps under several distillation settings\. The distillation objective aims to minimize the KL divergence between the output distributions of the LM heads corresponding to the teacher and student embeddings\.
The specific settings are as follows:
- •Continued MRL Pretraining: Continue standard matryoshka pretraining for 50k steps \(baseline for this experiment\)\.
- •Distill L12\-D768 to All: The embedding from the 12th layer and 768 dimensions acts as the teacher\. Its LM head output is used to distill knowledge to the LM head outputs of all 11 other combinations of layers \(4th, 8th, 12th\) and dimensions \(32, 64, 128, 768\), excluding the teacher itself\.
- •Specific Distillation: Use the LM head output from the 12th layer as the teacher and distill to the 4th layer, 64\-dimension embedding\. We experiment with two variants: \(1\) teacher = L12\-D64, and \(2\) teacher = L12\-D768\.
We evaluate the performance of the 4th layer, 64\-dimension embedding \(L4\-D64\) after these continued pretraining variants on theBing\-Clickdataset\. The results in Table[5](https://arxiv.org/html/2605.19568#S4.T5)suggest that leveraging deeper or higher\-dimensional embeddings as teacher signals during pretraining can indeed enhance the performance of shallower, lower\-dimensional student embeddings\. However, the ”Distill L12\-D768 to All” setting, where a single powerful teacher attempts to guide numerous diverse students simultaneously, does not show a comparable benefit \. This might indicate that overly broad distillation targets could introduce conflicting signals or noise, potentially interfering with the nuanced learning required for each specific layer\-dimension combination\.
Table 5\.Impact of Matryoshka Distillation\.Pretraining StrategyR@100R@1000MRL Pretraining \(300k steps\)62\.9685\.15\+ 50k additional stepsMRL Pretraining63\.0085\.20Distill L12\-D768 to All62\.9885\.20Distill L12\-D64 to L4\-D6463\.1285\.30Distill L12\-D768 to L4\-D6463\.1485\.27
### 4\.3\.Matryoshka Finetuning
We also investigate the application of the MRL principle during the supervised fine\-tuning \(SFT\) phase\. Standard SFT typically optimizes a model for a single embedding dimension, which often yields the best possible performance at that specific dimension\. MRL SFT aims to make the model perform well across a set of chosen dimensions \(𝒟SFT=\{32,64,128,768\}\\mathcal\{D\}\_\{\\text\{SFT\}\}=\\\{32,64,128,768\\\}form3BERT\-s2\\text\{m\}^\{3\}\\text\{BERT\-s2\}, and𝒟SFT=\{32,64,128\}\\mathcal\{D\}\_\{\\text\{SFT\}\}=\\\{32,64,128\\\}form3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}\), aggregating SFT loss contributions from these dimensions\. This explicit optimization for multiple granularities can sometimes introduce a slight performance decrease at any single dimension compared to dedicated standard SFT\.
Table[3](https://arxiv.org/html/2605.19568#S4.T3)presents the performance of Lite models fine\-tuned with the MRL SFT strategy, alongside the performance change \(in parentheses\) relative to their standard SFT counterparts \(from Table[1](https://arxiv.org/html/2605.19568#S3.T1), Lite section\)\. While MRL SFT can lead to minor performance drops across all models, ourm3BERT\\text\{m\}^\{3\}\\text\{BERT\}models , having been pretrained with an MRL objective, generally exhibit smaller performance degradation compared to the baseline models when subjected to MRL SFT\. This indicates that MRL pretraining effectively prepares the model architecture for subsequent MRL SFT, allowing it to maintain strong, flexible performance across various embedding dimensions with a more mitigated trade\-off\.
## 5\.Related Work
### 5\.1\.Pretrained Embedding Models
The advent of encoder\-only Transformer architectures, exemplified by BERT\(Kenton and Toutanova,[2019](https://arxiv.org/html/2605.19568#bib.bib1)\)and its successors like RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2605.19568#bib.bib2)\), has significantly advanced information retrieval by providing powerful contextualized text embeddings\. Subsequent specialized models such as Sentence\-BERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.19568#bib.bib21)\), and more recently E5\(Wanget al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib3)\)and ModernBERT\(Warneret al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib5)\), have further refined these embeddings for retrieval tasks through techniques like contrastive learning\(Gaoet al\.,[2021](https://arxiv.org/html/2605.19568#bib.bib22); Reimers and Gurevych,[2019](https://arxiv.org/html/2605.19568#bib.bib21); Lanet al\.,[2025](https://arxiv.org/html/2605.19568#bib.bib4)\)and instruction tuning\(Wanget al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib3); Weiet al\.,[2021](https://arxiv.org/html/2605.19568#bib.bib23)\)\. These models are typically pretrained on vast corpora and then fine\-tuned for specific domains\.
However, a critical limitation of these existing pretrained embedding models is their fixed architecture and output dimension\. This inherent rigidity makes it challenging to efficiently adapt them to diverse downstream applications with varying latency and computational resource constraints, common in commercial systems\. While solutions like task\-specific fine\-tuning of smaller models by partial initialization or post\-hoc compression\(Hintonet al\.,[2015](https://arxiv.org/html/2605.19568#bib.bib24); Hanet al\.,[2015](https://arxiv.org/html/2605.19568#bib.bib25); Jacobet al\.,[2018](https://arxiv.org/html/2605.19568#bib.bib26)\)exist, they often fail to fully leverage the extensive knowledge of large pretrained models\.
### 5\.2\.Matryoshka Representation Learning
Matryoshka Representation Learning \(MRL\)\(Kusupatiet al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib6)\)presents a novel approach to create adaptable representations by training a single model whose embeddings are effective even when truncated to various smaller nested dimensions \. This allows for a direct trade\-off between performance and computational cost without needing to train or store multiple distinct models, and has shown promise in vision\(Wanget al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib27)\)and language tasks\(Laiet al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib28)\)by enabling flexible deployment\. Beyond output embeddings, the core principle of MRL has also been extended to internal model components\(Devvritet al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib57); Wanget al\.,[2025](https://arxiv.org/html/2605.19568#bib.bib59)\), and its versatility is further demonstrated in multimodal learning\(Caiet al\.,[2025](https://arxiv.org/html/2605.19568#bib.bib58); Huet al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib60)\)and federated learning for knowledge sharing\(Yiet al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib61)\)\.
While MRL offers valuable dimensional flexibility, its application focused on the output embeddings, often during fine\-tuning stages\(Laiet al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib28)\), rather than being integrated into the pretraining of foundation models\. Furthermore, existing MRL work primarily addresses variability in embedding dimension\(Kusupatiet al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib6)\), not extending this adaptive capability to the model’s architectural depth\. Our work is inspired by MRL’s efficiency but aims to extend its principles by incorporating multigranular optimization across both embedding dimensions and transformer layers directly within the pretraining phase, creating a more fundamentally versatile foundation model\.
### 5\.3\.Efficient Model Deployment
To bridge the gap between large, powerful models and resource\-constrained deployment environments, a variety of techniques have gained prominence\. Knowledge distillation\(Hintonet al\.,[2015](https://arxiv.org/html/2605.19568#bib.bib24)\)focuses on transferring complex capabilities from teacher to student models\(Hsiehet al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib49); Miaoet al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib50); Guet al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib51)\)\. Other methods operate directly on a trained network, such as pruning techniques which can remove significant weights in a one\-shot manner\(Frantar and Alistarh,[2023](https://arxiv.org/html/2605.19568#bib.bib52); Sunet al\.,[2024](https://arxiv.org/html/2605.19568#bib.bib53)\), and advanced quantization methods that drastically reduce the numerical precision of parameters with minimal performance loss\(Dettmerset al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib54); Frantaret al\.,[2023](https://arxiv.org/html/2605.19568#bib.bib55)\)\. A different approach involves dynamic networks that adapt computation at inference time through adaptive layer skipping or early exiting based on input confidence\(Schusteret al\.,[2022](https://arxiv.org/html/2605.19568#bib.bib56)\)\.
These methods have limitations in the context of flexible, large\-scale deployment\. Knowledge distillation requires separate training and maintaining a student model for each specific performance target, which can be computationally expensive and add significant engineering overhead\. Dynamic methods like early exiting are not directly controllable for deploying a model variant tailored to a specific hardware profile \. Pruning and quantization are largely orthogonal to our work and can be combined with our approach\.
## 6\.Conclusion
We proposedm3BERT\\text\{m\}^\{3\}\\text\{BERT\}, a Modern, Multilingual, Matryoshka Bidirectional Encoder that leverages a novel matryoshka pretraining strategy to optimize embedding representations across multiple transformer layers and dimensions\. Extensive experiments on the BINGCLICK dataset demonstrate thatm3BERT\\text\{m\}^\{3\}\\text\{BERT\}consistently outperforms state\-of\-the\-art models across various model sizes and embedding dimensions, effectively balancing resource efficiency and retrieval performance in industrial settings\. The model’s effectiveness is also verified by its successful deployment at Bing\.
## References
- J\. Bai, S\. Bai, Y\. Chu, Z\. Cui, K\. Dang, X\. Deng, Y\. Fan, W\. Ge, Y\. Han, F\. Huang,et al\.\(2023\)Qwen technical report\.arXiv preprint arXiv:2309\.16609\.Cited by:[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px2.p1.1)\.
- P\. Bajaj, D\. Campos, N\. Craswell, L\. Deng, J\. Gao, X\. Liu, R\. Majumder, A\. McNamara, B\. Mitra, T\. Nguyen, M\. Rosenberg, X\. Song, A\. Stoica, S\. Tiwary, and T\. Wang \(2018\)MS marco: a human generated machine reading comprehension dataset\.External Links:1611\.09268Cited by:[§3\.1](https://arxiv.org/html/2605.19568#S3.SS1.SSS0.Px1.p1.1)\.
- M\. Cai, J\. Yang, J\. Gao, and Y\. J\. Lee \(2025\)Matryoshka multimodal models\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§5\.2](https://arxiv.org/html/2605.19568#S5.SS2.p1.1)\.
- Z\. Cheng, H\. Zhang, K\. Li, S\. Leng, Z\. Hu, F\. Wu, D\. Zhao, X\. Li, and L\. Bing \(2024\)Breaking the memory barrier: near infinite batch size scaling for contrastive loss\.arXiv preprint arXiv:2410\.17243\.Cited by:[§2\.3](https://arxiv.org/html/2605.19568#S2.SS3.SSS0.Px3.p1.1)\.
- T\. Dao, D\. Fu, S\. Ermon, A\. Rudra, and C\. Ré \(2022\)Flashattention: fast and memory\-efficient exact attention with io\-awareness\.Advances in neural information processing systems35,pp\. 16344–16359\.Cited by:[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px4.p1.1)\.
- T\. Dao \(2023\)Flashattention\-2: faster attention with better parallelism and work partitioning\.arXiv preprint arXiv:2307\.08691\.Cited by:[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px4.p1.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)QLoRA: efficient finetuning of quantized llms\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 10088–10115\.Cited by:[§5\.3](https://arxiv.org/html/2605.19568#S5.SS3.p1.1)\.
- F\. Devvrit, S\. Kudugunta, A\. Kusupati, T\. Dettmers, K\. Chen, I\. Dhillon, Y\. Tsvetkov, H\. Hajishirzi, S\. Kakade, A\. Farhadi,et al\.\(2024\)Matformer: nested transformer for elastic inference\.Advances in Neural Information Processing Systems\.Cited by:[§5\.2](https://arxiv.org/html/2605.19568#S5.SS2.p1.1)\.
- E\. Frantar and D\. Alistarh \(2023\)SparseGPT: massive language models can be accurately pruned in one\-shot\.InInternational Conference on Machine Learning, ICML 2023,Proceedings of Machine Learning Research, Vol\.202,pp\. 10323–10337\.Cited by:[§5\.3](https://arxiv.org/html/2605.19568#S5.SS3.p1.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2023\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.InInternational Conference on Learning Representations,Cited by:[§5\.3](https://arxiv.org/html/2605.19568#S5.SS3.p1.1)\.
- L\. Gao, Y\. Wang, M\. Peng, J\. Tang, Y\. Shang, M\. Sun, and J\. Su \(2025\)Tool graph retriever: exploring dependency graph\-based tool retrieval for large language models\.arXiv preprint arXiv:2508\.05152\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p1.1)\.
- T\. Gao, X\. Yao, and D\. Chen \(2021\)SimCSE: simple contrastive learning of sentence embeddings\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p1.1)\.
- D\. Groeneveld, I\. Beltagy, P\. Walsh, A\. Bhagia, R\. Kinney, O\. Tafjord, A\. H\. Jha, H\. Ivison, I\. Magnusson, Y\. Wang,et al\.\(2024\)Olmo: accelerating the science of language models\.arXiv preprint arXiv:2402\.00838\.Cited by:[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px1.p1.1)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2024\)MiniLLM: knowledge distillation of large language models\.InInternational Conference on Learning Representations,Cited by:[§5\.3](https://arxiv.org/html/2605.19568#S5.SS3.p1.1)\.
- S\. Han, H\. Mao, and W\. J\. Dally \(2015\)Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding\.arXiv preprint arXiv:1510\.00149\.Cited by:[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p2.1)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p2.1),[§5\.3](https://arxiv.org/html/2605.19568#S5.SS3.p1.1)\.
- C\. Hsieh, C\. Li, C\. Yeh, H\. Nakhost, Y\. Fujii, A\. Ratner, R\. Krishna, C\. Lee, and T\. Pfister \(2023\)Distilling step\-by\-step\! outperforming larger language models with less training data and smaller model sizes\.InFindings of the Association for Computational Linguistics: ACL 2023,Toronto, Canada,pp\. 8003–8017\.Cited by:[§5\.3](https://arxiv.org/html/2605.19568#S5.SS3.p1.1)\.
- W\. Hu, Z\. Dou, L\. Li, A\. Kamath, N\. Peng, and K\. Chang \(2024\)Matryoshka query transformer for large vision\-language models\.Advances in Neural Information Processing Systems\.Cited by:[§5\.2](https://arxiv.org/html/2605.19568#S5.SS2.p1.1)\.
- B\. Jacob, S\. Kligys, B\. Chen, M\. Zhu, M\. Tang, A\. Howard, H\. Adam, and D\. Kalenichenko \(2018\)Quantization and training of neural networks for efficient integer\-arithmetic\-only inference\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 2704–2713\.Cited by:[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p2.1)\.
- G\. Jung, M\. A\. Hiltunen, K\. R\. Joshi, R\. D\. Schlichting, and C\. Pu \(2010\)Mistral: dynamically managing power, performance, and adaptation cost in cloud infrastructures\.In2010 IEEE 30th International Conference on Distributed Computing Systems,pp\. 62–73\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p5.2)\.
- J\. D\. M\. C\. Kenton and L\. K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of naacL\-HLT,Vol\.1\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.19568#S2.SS3.SSS0.Px2.p1.4),[1st item](https://arxiv.org/html/2605.19568#S3.I2.i1.p1.1.1),[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p1.1)\.
- A\. Kusupati, G\. Bhatt, A\. Rege, M\. Wallingford, A\. Sinha, V\. Ramanujan, W\. Howard\-Snyder, K\. Chen, S\. Kakade, P\. Jain,et al\.\(2022\)Matryoshka representation learning\.Advances in Neural Information Processing Systems35,pp\. 30233–30249\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p4.1),[§5\.2](https://arxiv.org/html/2605.19568#S5.SS2.p1.1),[§5\.2](https://arxiv.org/html/2605.19568#S5.SS2.p2.1)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov \(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7\.Cited by:[§3\.1](https://arxiv.org/html/2605.19568#S3.SS1.SSS0.Px1.p1.1)\.
- R\. Lai, L\. Chen, W\. Chen, and R\. Chen \(2024\)Matryoshka representation learning for recommendation\.arXiv preprint arXiv:2406\.07432\.Cited by:[§5\.2](https://arxiv.org/html/2605.19568#S5.SS2.p1.1),[§5\.2](https://arxiv.org/html/2605.19568#S5.SS2.p2.1)\.
- Z\. Lan, L\. Niu, F\. Meng, J\. Zhou, and J\. Su \(2025\)Llave: large language and vision embedding models with hardness\-weighted contrastive learning\.arXiv preprint arXiv:2503\.04812\.Cited by:[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)Roberta: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p2.1),[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§2\.3](https://arxiv.org/html/2605.19568#S2.SS3.SSS0.Px4.p1.9),[§3\.1](https://arxiv.org/html/2605.19568#S3.SS1.SSS0.Px3.p2.2)\.
- Z\. Miao, W\. Zhang, J\. Su, X\. Li, J\. Luan, Y\. Chen, B\. Wang, and M\. Zhang \(2023\)Exploring all\-in\-one knowledge distillation framework for neural machine translation\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 2929–2940\.Cited by:[§5\.3](https://arxiv.org/html/2605.19568#S5.SS3.p1.1)\.
- M\. M\. Multi\-Granularity \(2024\)M3\-embedding: multi\-linguality, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p5.2),[§3\.1](https://arxiv.org/html/2605.19568#S3.SS1.SSS0.Px2.p2.2)\.
- \[30\]OrdalieTechSolon embeddings large 0\.1\.Note:[https://huggingface\.co/OrdalieTech/Solon\-embeddings\-large\-0\.1](https://huggingface.co/OrdalieTech/Solon-embeddings-large-0.1)Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p5.2)\.
- R\. Patil, S\. Boit, V\. Gudivada, and J\. Nandigam \(2023\)A survey of text representation and embedding techniques in nlp\.IEEE Access11,pp\. 36120–36146\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,Cited by:[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p1.1)\.
- T\. Schuster, A\. Fisch, J\. Gupta, M\. Dehghani, D\. Bahri, V\. Tran, Y\. Tay, and D\. Metzler \(2022\)Confident adaptive language modeling\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 17456–17472\.Cited by:[§5\.3](https://arxiv.org/html/2605.19568#S5.SS3.p1.1)\.
- N\. Shazeer \(2020\)Glu variants improve transformer\.arXiv preprint arXiv:2002\.05202\.Cited by:[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px1.p1.1)\.
- D\. Su, K\. Kong, Y\. Lin, J\. Jennings, B\. Norick, M\. Kliegl, M\. Patwary, M\. Shoeybi, and B\. Catanzaro \(2024\)Nemotron\-cc: transforming common crawl into a refined long\-horizon pretraining dataset\.arXiv preprint arXiv:2412\.02595\.Cited by:[§2\.3](https://arxiv.org/html/2605.19568#S2.SS3.SSS0.Px1.p1.2)\.
- M\. Sun, Z\. Liu, A\. Bair, and Z\. Kolter \(2024\)A simple and effective pruning approach for large language models\.InInternational Conference on Representation Learning,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),pp\. 4942–4964\.Cited by:[§5\.3](https://arxiv.org/html/2605.19568#S5.SS3.p1.1)\.
- \[37\]G\. Team, T\. Mesnard, C\. Hardin, R\. Dadashi, S\. Bhupatiraju, S\. Pathak, L\. Sifre, M\. Rivière, M\. S\. Kale, J\. Love,et al\.Gemma: open models based on gemini research and technology, 2024\.URL https://arxiv\. org/abs/2403\.082952,pp\. 10–19\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p5.2)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p5.2),[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px2.p1.1),[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.19568#S4.SS1.p1.3)\.
- E\. Voorhees, T\. Alam, S\. Bedrick, D\. Demner\-Fushman, W\. R\. Hersh, K\. Lo, K\. Roberts, I\. Soboroff, and L\. L\. Wang \(2020\)TREC\-covid: constructing a pandemic information retrieval test collection\.External Links:2005\.04474Cited by:[§3\.1](https://arxiv.org/html/2605.19568#S3.SS1.SSS0.Px1.p1.1)\.
- L\. Wang, N\. Yang, X\. Huang, B\. Jiao, L\. Yang, D\. Jiang, R\. Majumder, and F\. Wei \(2022\)Text embeddings by weakly\-supervised contrastive pre\-training\.arXiv preprint arXiv:2212\.03533\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p2.1),[§1](https://arxiv.org/html/2605.19568#S1.p5.2),[2nd item](https://arxiv.org/html/2605.19568#S3.I2.i2.p1.1.1),[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p1.1)\.
- Q\. Wang, B\. Li, T\. Xiao, J\. Zhu, C\. Li, D\. F\. Wong, and L\. S\. Chao \(2019\)Learning deep transformer models for machine translation\.arXiv preprint arXiv:1906\.01787\.Cited by:[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px2.p1.1)\.
- Y\. Wang, Q\. Hu, Y\. Ding, R\. Wang, Y\. Gong, J\. Jiao, Y\. Shen, P\. Cheng, and J\. Su \(2025\)Training matryoshka mixture\-of\-experts for elastic inference\-time expert utilization\.arXiv preprint arXiv:2509\.26520\.Cited by:[§5\.2](https://arxiv.org/html/2605.19568#S5.SS2.p1.1)\.
- Y\. Wang, Z\. Yue, H\. Zeng, D\. Wang, and J\. McAuley \(2024\)Train once, deploy anywhere: matryoshka representation learning for multimodal recommendation\.arXiv preprint arXiv:2409\.16627\.Cited by:[§5\.2](https://arxiv.org/html/2605.19568#S5.SS2.p1.1)\.
- B\. Warner, A\. Chaffin, B\. Clavié, O\. Weller, O\. Hallström, S\. Taghadouini, A\. Gallagher, R\. Biswas, F\. Ladhak, T\. Aarsen, N\. Cooper, G\. Adams, J\. Howard, and I\. Poli \(2024\)Smarter, better, faster, longer: a modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference\.External Links:2412\.13663Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px3.p1.1),[3rd item](https://arxiv.org/html/2605.19568#S3.I2.i3.p1.1),[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p1.1)\.
- J\. Wei, M\. Bosma, V\. Y\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. Le \(2021\)Finetuned language models are zero\-shot learners\.arXiv preprint arXiv:2109\.01652\.Cited by:[§5\.1](https://arxiv.org/html/2605.19568#S5.SS1.p1.1)\.
- A\. Wettig, T\. Gao, Z\. Zhong, and D\. Chen \(2022\)Should you mask 15% in masked language modeling?\.arXiv preprint arXiv:2202\.08005\.Cited by:[§2\.3](https://arxiv.org/html/2605.19568#S2.SS3.SSS0.Px1.p1.2)\.
- Z\. Xiang, C\. Wu, Q\. Zhang, S\. Chen, Z\. Hong, X\. Huang, and J\. Su \(2025\)When to use graphs in rag: a comprehensive analysis for graph retrieval\-augmented generation\.arXiv preprint arXiv:2506\.05690\.Cited by:[§1](https://arxiv.org/html/2605.19568#S1.p1.1)\.
- L\. Yi, H\. Yu, C\. Ren, G\. Wang, X\. Li,et al\.\(2024\)Federated model heterogeneous matryoshka representation learning\.Advances in Neural Information Processing Systems\.Cited by:[§5\.2](https://arxiv.org/html/2605.19568#S5.SS2.p1.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.Advances in Neural Information Processing Systems32\.Cited by:[§2\.2](https://arxiv.org/html/2605.19568#S2.SS2.SSS0.Px2.p1.1)\.
## Appendix AImplementation Details
### A\.1\.Evaluation
As mentioned in Section 4\.1, we conduct experiments on four benchmark datasets: BING\-CLICK, MS MARCO Document Ranking, Natural Questions, and TREC\-COVID\. All datasets are formatted as query\-document pairs to align with our model’s input requirements\. Specifically, for the Natural Questions dataset, we use question\-answer pairs where the answer span is treated as the relevant document\. For TREC\-COVID, we adopt the title\-text pairs, where the title represents the query and the document text serves as the candidate passage\.
Our evaluation pipeline leverages the sentence\-transformers for generating text embeddings from the fine\-tuned models\. For the retrieval tasks, we employ Faiss, a library for efficient similarity search\. Specifically, we perform an exhaustive \(exact\) search using the inner product similarity \(cosine similarity on L2\-normalized embeddings\)\. This is typically achieved using Faiss indices likeIndexFlatIPafter ensuring all embeddings are L2\-normalized\. No approximate nearest neighbor \(ANN\) search methods were used for the reported results to ensure precise recall figures\.
The Recall@K metric is computed as follows: For each queryqqin the test set and its corresponding ground\-truth positive documentd\+∈𝒟poold^\{\+\}\\in\\mathcal\{D\}\_\{\\text\{pool\}\}\(where𝒟pool\\mathcal\{D\}\_\{\\text\{pool\}\}is the 10 million document candidate pool\):
1. \(1\)Obtain the query embeddingE\(q\)E\(q\)and the embeddings for all documentsdi∈𝒟poold\_\{i\}\\in\\mathcal\{D\}\_\{\\text\{pool\}\}, denotedE\(di\)E\(d\_\{i\}\)\.
2. \(2\)Calculate the similarity score \(cosine similarity\) betweenE\(q\)E\(q\)and everyE\(di\)E\(d\_\{i\}\)\.
3. \(3\)Rank all documents in𝒟pool\\mathcal\{D\}\_\{\\text\{pool\}\}based on their similarity scores withE\(q\)E\(q\)in descending order\.
4. \(4\)If the ground\-truth positive documentd\+d^\{\+\}is found within the top K ranked documents, this query is considered a ”hit” for Recall@K\.
5. \(5\)The final Recall@K is the proportion of queries for whichd\+d^\{\+\}was retrieved within the top K results, averaged over all test queries\.
We report Recall@100 and Recall@1000 in our experiments\.
### A\.2\.Matryoshka Distillation
As explored in Section 4\.3\.3, Matryoshka distillation aims to leverage deeper or higher\-dimensional embeddings as teacher signals for shallower or lower\-dimensional student embeddings during pretraining\. The distillation objective minimizes the Kullback\-Leibler \(KL\) divergence between the predicted token distributions from the LM heads of the teacher and student embeddings\. Given a teacher embedding outputhTh\_\{T\}\(e\.g\., from layerlTl\_\{T\}, dimensiondTd\_\{T\}\) and a student embedding outputhSh\_\{S\}\(e\.g\., from layerlSl\_\{S\}, dimensiondSd\_\{S\}\), their respective predicted token distributions over the vocabularyVVarePT=Softmax\(hTW\[:dT,:\]/τD\)P\_\{T\}=\\text\{Softmax\}\(h\_\{T\}W\_\{\[:d\_\{T\},:\]\}/\\tau\_\{D\}\)andPS=Softmax\(hSW\[:dS,:\]/τD\)P\_\{S\}=\\text\{Softmax\}\(h\_\{S\}W\_\{\[:d\_\{S\},:\]\}/\\tau\_\{D\}\), whereWWis the shared MLM projection matrix andτD\\tau\_\{D\}is the distillation temperature\. The distillation loss for a single teacher\-student pair is:
\(2\)ℒdistill\(PS\|\|PT\)=∑v∈VPS\(v\)logPS\(v\)PT\(v\)\\mathcal\{L\}\_\{\\text\{distill\}\}\(P\_\{S\}\|\|P\_\{T\}\)=\\sum\_\{v\\in V\}P\_\{S\}\(v\)\\log\\frac\{P\_\{S\}\(v\)\}\{P\_\{T\}\(v\)\}When multiple student \(or teacher\) embeddings are involved, such as in the ”Distill L12\-D768 to All” setting, the individual KL divergence losses are summed\. This distillation loss is then added to the primary Matryoshka pretraining loss \(e\.g\., multigranular MLM lossℒMRL\\mathcal\{L\}\_\{\\text\{MRL\}\}\):
\(3\)ℒtotal=ℒMRL\+λD∑\(S,T\)∈𝒫distillℒdistill\(PS\|\|PT\)\\mathcal\{L\}\_\{\\text\{total\}\}=\\mathcal\{L\}\_\{\\text\{MRL\}\}\+\\lambda\_\{D\}\\sum\_\{\(S,T\)\\in\\mathcal\{P\}\_\{\\text\{distill\}\}\}\\mathcal\{L\}\_\{\\text\{distill\}\}\(P\_\{S\}\|\|P\_\{T\}\)where𝒫distill\\mathcal\{P\}\_\{\\text\{distill\}\}is the set of chosen teacher\-student pairs andλD\\lambda\_\{D\}is a weighting factor for the distillation loss \(set to1\.01\.0in our experiments\)\.
We observed that initiating Matryoshka Distillation from the very beginning of pretraining can lead to instability\. This is likely because the initial representations, especially from deeper layers and higher dimensions, are noisy and not yet effective teachers\. Therefore, in our experiments reported in Table[5](https://arxiv.org/html/2605.19568#S4.T5), we first pretrained them3BERT\\text\{m\}^\{3\}\\text\{BERT\}model for 300,000 steps using standard Matryoshka pretraining \(as in Stage 1\) and then continued pretraining for an additional 50,000 steps with the distillation objective incorporated\. The distillation temperatureτD\\tau\_\{D\}was set to1\.01\.0\. Other hyperparameters \(learning rate, batch size, etc\.\) for these 50,000 steps remained consistent with the Stage 1 pretraining configuration\. This approach shows promise for enhancing the quality of more compact representations within the Matryoshka framework\.
### A\.3\.Matryoshka Finetuning
For Matryoshka Finetuning, as discussed in Section 4\.3\.4, we adapt the standard supervised fine\-tuning \(SFT\) objective to simultaneously optimize the model for multiple embedding dimensions\. Given a set of target embedding dimensions𝒟SFT\\mathcal\{D\}\_\{\\text\{SFT\}\}\(e\.g\.,\{32,64,128,768\}\\\{32,64,128,768\\\}form3BERT\-s2\\text\{m\}^\{3\}\\text\{BERT\-s2\}and\{32,64,128\}\\\{32,64,128\\\}form3BERT\-s3\\text\{m\}^\{3\}\\text\{BERT\-s3\}\), the MRL SFT loss is the sum of the standard SFT losses computed for each dimension in𝒟SFT\\mathcal\{D\}\_\{\\text\{SFT\}\}:
\(4\)ℒMRL\-SFT=∑d∈𝒟SFTℒSFT\(E\[:d\]\)\\mathcal\{L\}\_\{\\text\{MRL\-SFT\}\}=\\sum\_\{d\\in\\mathcal\{D\}\_\{\\text\{SFT\}\}\}\\mathcal\{L\}\_\{\\text\{SFT\}\}\(E\_\{\[:d\]\}\)whereE\[:d\]E\_\{\[:d\]\}denotes the model’s output embedding truncated to the firstdddimensions, andℒSFT\(E\[:d\]\)\\mathcal\{L\}\_\{\\text\{SFT\}\}\(E\_\{\[:d\]\}\)is the contrastive loss calculated using thesedd\-dimensional embeddings\. The individual SFT losses for different dimensions are directly summed\. All other hyperparameters for MRL SFT \(e\.g\., learning rate, batch size, temperatureτ\\taufor the contrastive loss\) were kept identical to those used for the standard single\-dimension SFT experiments reported in the main results\.
## Appendix BPretraining Data
For all of the pretraining stages, input sequences were processed by padding shorter sequences and truncating longer ones to a maximum length of 1024 tokens using mE5 tokenizer\.
#### Nemotron\-CC \(Stage 1\)
The English pretraining corpus is derived from the high\-quality split of Nemotron\-CC, which is itself based on CommonCrawl \(CC\) and refined through extensive filtering and deduplication\. We utilized a subset of approximately 100 billion tokens from the high\-quality split\.
Table 6\.Smoothed sampling probabilities for the top 20 languages \(Stage 2 multilingual Wikipedia corpus\)\.CodeLanguage NameProb\.\(%\)CodeLanguage NameProb\.\(%\)enEnglish20\.99cebCebuano20\.33deGerman11\.89svSwedish11\.08frFrench11\.05nlDutch9\.73ruRussian9\.11esSpanish8\.77itItalian8\.74arzEgyptian Arabic8\.02plPolish7\.90jaJapanese7\.20zhChinese7\.18ukUkrainian6\.85viVietnamese6\.83warWaray\-Waray6\.75arArabic6\.57ptPortuguese6\.16faPersian5\.64caCatalan4\.62
#### Multilingual Wikipedia \(Stage 2\)
For multilingual adaptation, we used publicly available Wikipedia dumps333[https://meta\.wikimedia\.org/wiki/List\_of\_Wikipedias](https://meta.wikimedia.org/wiki/List_of_Wikipedias)for the top 100 languages by article count\. The selection of languages and initial processing followed mBERT\. The raw data from these dumps, totaling approximately 20 billion tokens, was further processed\. As mentioned in the main text, an exponential smoothing strategy \(P′\(L\)∝P\(L\)SP^\{\\prime\}\(L\)\\propto P\(L\)^\{S\}withS=0\.7S=0\.7\) was applied to upsample low\-resource languages and downsample high\-resource languages\. This creates a more balanced multilingual training mixture, preventing high\-resource languages from dominating the training process\. Table[6](https://arxiv.org/html/2605.19568#A2.T6)illustrates the smoothed sampling probabilities for the top 20 languages in our multilingual corpus after applying this strategy\.
#### Web Domain Corpus \(Stage 3\)
The corpus for Stage 3 continual pretraining, designed to specialize the model for ads and web search\-related industrial retrieval tasks, consists of 10 billion query\-item pairs derived from anonymized and aggregated user interaction logs within an advertising system\. These pairs represent implicit positive associations between user queries and relevant advertising content\. We constructed three primary types of query\-item pairs:
- •\(Query, Keyword\):Pairs linking user search queries to advertiser\-defined keywords that triggered ad impressions\.
- •\(Query, Ad Creative\):Pairs linking user search queries to the textual content of advertisements they interacted with\. The ”Ad Creative” item concatenates the ad’s title and descriptive text\.
- •\(Query, Landing Page Information\):Pairs linking user search queries to textual information associated with the ad’s destination\. The ”Landing Page Information” item is a concatenation of text extracted from resources related to the ad’s landing URL, such as its associated page title, prominent headings, and representative content snippets\. Compared to the carefully curated and processedBing\-Clickevaluation data, this pretraining corpus for Stage 3 comprises a significantly larger volume of text that is more diverse in form, contains more noise, and is generally of lower raw quality, reflecting the scale and nature of real\-world advertising interaction data\.
These different pair types capture diverse semantic relationships relevant to advertising retrieval\. In the underlying advertising ecosystem, a single keyword can be associated with multiple ad creatives, and an ad creative can point to various landing page information items \(and vice versa\)\. This rich interconnectedness provides a diverse training signal\.
The corpus was rigorously pre\-processed, including the duplication of identical query\-item pairs and query\-level sampling caps to mitigate frequency bias\. The model was trained on this corpus for approximately 2 trillion tokens using the Infinite Contrastive Learning \(Inf\-CL\)\. The input for the ”item” side of the pair was formed by concatenating the relevant textual fields \(e\.g\., ad title \+ ad description for ”Ad Creative”\) using special separator tokens, then tokenized and processed to the model’s maximum input length\. This domain\-specific pretraining significantly enhances the model’s utility for commercial retrieval applications\.
## Appendix CUse of Large Language Models
In preparing this manuscript, the LLM was used for language polishing, such as rephrasing sentences for clarity and correcting grammar\. The human authors critically reviewed and edited all LLM\-generated outputs, and retain full responsibility for the final content, methodology, and conclusions of this work\.Similar Articles
A Reproducible Multi-Architecture Baseline for Token-Level Chinese Metaphor Identification under the MIPVU Framework
This paper establishes a reproducible multi-architecture baseline for token-level Chinese metaphor identification using the MIPVU framework and the PSU Chinese Metaphor Corpus. It compares encoder models like RoBERTa and MelBERT against the Qwen3.5-9B generative model, releasing code and data to facilitate future research.
Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built via cross-lingual tokenizer surgery and offline distillation, achieving strong performance on Turkish benchmarks with a cost-quality trade-off.
A Causal Language Modeling Detour Improves Encoder Continued Pretraining
This paper demonstrates that switching from Masked Language Modeling to Causal Language Modeling during encoder adaptation improves downstream performance on biomedical texts. The authors release ModernBERT-bio and ModernCamemBERT-bio as state-of-the-art biomedical encoders.
The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP
This paper introduces ChristBERT, a family of domain-specific RoBERTa-based language models for German clinical NLP, and evaluates three domain adaptation strategies (continued pre-training, pre-training from scratch, and vocabulary adaptation) on medical named entity recognition and text classification tasks, achieving state-of-the-art results.
Introducing the Ettin Reranker Family
Introducing the Ettin Reranker family: six new state-of-the-art CrossEncoder rerankers at various sizes, built on ModernBERT encoders, with open-source data and training recipe.