@vintcessun: 原来LLM文本embedding被高频token(句号、冠词)绑架了!Unembedding矩阵隐式定义了一个低秩子空间,主导这些无信息量的表达。这是LLM作为通用embedding效果不佳的根本原因,且污染很隐蔽。EmbedFilter…
摘要
该研究揭示了LLM文本嵌入被高频token(如句号、冠词)绑架的问题,提出EmbedFilter方法通过对unembedding矩阵进行SVD分解并减去投影分量来释放真实语义,实现零训练开销的降维和检索效率提升。
查看缓存全文
缓存时间: 2026/06/12 08:59
原来LLM文本embedding被高频token(句号、冠词)绑架了!Unembedding矩阵隐式定义了一个低秩子空间,主导这些无信息量的表达。这是LLM作为通用embedding效果不佳的根本原因,且污染很隐蔽。EmbedFilter:对unembedding矩阵做SVD取前k奇异向量构成子空间,从embedding中减去该投影分量——一次线性变换就能释放真实语义,自然降维且零训练开销,检索索引效率翻倍。
Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings
Source: https://arxiv.org/html/2606.07502 (2018)
Abstract.
Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model’s ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available athttps://github.com/CentreChen/EmbFilter.
Zero-shot Text Embedding, Large Language Model, Mechanistic Interpretation
††copyright:acmlicensed††journalyear:2018††doi:XXXXXXX.XXXXXXX††conference:Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY††isbn:978-1-4503-XXXX-X/2018/06††ccs:Information systems Language models††ccs:Information systems Novelty in information retrieval## 1.Introduction
Large language models (LLMs) have made significant strides in recent years, demonstrating impressive performance across a wide range of tasks(DeepSeek-AI,2026; Grattafioriet al.,2024; Team,2024). The emergence of zero-shot learning ability helps LLMs address unseen tasks effectively without any additional fine-tuning(Kaplanet al.,2020). However, recent studies highlight a persistent performance gap of LLMs when deployed as zero-shot text embedding models(Jianget al.,2024; Li and Zhou,2025; BehnamGhaderet al.,2024). This deficiency hinders their adoption for text embedding tasks and raises concerns regarding their full efficacy as generalist models in real-world applications.
To bridge this gap, researchers have explored various attempts to better elicit semantic information from LLMs. Prompt-engineering methods have been proposed to help extract text embeddings directly from LLMs(Jianget al.,2024; Springeret al.,2025; Leiet al.,2024; Thirukovalluru and Dhingra,2025). These approaches are well motivated; however, their improvements are modest and highly sensitive to the choice of the prompt, leading to inconsistent performance across different setups. Existing approaches are primarily heuristic and fail to resolve the bottleneck that limits LLMs’ ability to capture semantics. In this paper, we move beyond previous heuristic efforts and seek to provide a mechanistic interpretation for LLMs’ suboptimal performance in text embedding tasks. Specifically, we identify an unexpected representation collapse: when projected onto the vocabulary space, raw text embeddings from LLMs tend to align with high-frequency tokens that are semantically irrelevant. Equipped with the Logit Lens tool(Belroseet al.,2023), we find that frequent but uninformative tokens disproportionately dominate the highest decoding probabilities of these text embeddings. This suggests that these hidden representations are biased toward common vocabulary tokens, regardless of the input semantics111For readers unfamiliar with Logit Lens, please refer to Section2for further details.. As shown in Figure1, this phenomenon is observed across different language model families, indicating a universal pattern inherent to LLMs.
Figure 1.Logit Lens applied to text embeddings from three LLM backbones. Word clouds show the top-aligned tokens with the highest decoding probabilities, which are primarily high-frequency yet semantically uninformative. The input text, encoded by the text embeddings, is given as:”We call this a ‘lens’ because it is one way of extracting information from GPT’s internal activations. I imagine there is other information present in the activations that cannot be understood by looking at logits over tokens. The logit lens show us some of what is going on, not all of it.”This corresponds to the official notation of the logit lens.We extend our analysis to uncover the underlying drivers of this representation collapse. Prior studies(Liet al.,2020; Ethayarajh,2019)have established that text embeddings areanisotropic: they are confined to a narrow cone rather than being uniformly distributed in the embedding space. We hypothesize that the centroid of this narrow region corresponds to an “average” token, whichLvet al.(2024)describe as the frequency-weighted average embedding over the training corpus. This perspective provides a mechanistic rationale for the atypical patterns observed in Logit Lens analyses. Raw embeddings from LLMs are pulled toward this commonality region, overshadowing their unique semantic features. By suppressing the contribution of these ”average” components, we can mitigate the anisotropy problem and unmask the true semantic representations within LLMs.
We seek to pinpoint the hidden contributor that steer text embeddings towards the ”average” token representation. To this end, we apply Logit Spectroscopy(Cancedda,2024)to a reverse-engineered ”average” token, and uncover a latent subspace, which is actively writing these frequent tokens into the embedding space. We refer to this subspace as the”edge spectrum”space, as it is spanned by the right singular vectors with the smallest and largest singular values — those positioned at the ends of the spectrum. We find that when the projection of the ”average” token onto this subspace is truncated, the logits of these frequent tokens are significantly disrupted. Section3delves into the discovery of the edge spectrum, providing a detailed account of its identification
Leveraging this insight, we show that this subspace can be effectively filtered out via a simple linear transformation, which we term EmbedFilter. This transformation is encoded within the parameters of the unembedding matrix and is readily accessible without further training. Our evaluations across a diverse suite of downstream tasks demonstrate that EmbedFilter acts as a potent post-processing enhancement, delivering steady incremental gains atop existing zero-shot text embedding baselines. EmbedFilter exhibits strong robustness across various backbone models and experimental configurations while incurring minimal computational overhead. Beyond performance gains, EmbedFilter naturally lends itself to dimensionality reduction as a distance-preserving transformation. This reduction lowers indexing overhead and speeds up retrieval, facilitating the practical deployment of LLMs.
To sum up, the contributions of this paper are threefold.
(1) We identify the LLM unembedding matrix as a previously overlooked feature lens to analyze the embedding space. We reveal that this matrix encodes a latent subspace corresponding to an ”average” token and limits the embedding capabilities of LLMs. We provide an mechanism interpretation that clarifies both the origins and impact of this phenomenon.
(2) We introduce EmbedFilter, a simple linear transformation that improves the zero-shot text embedding performance of LLMs. As an efficient post-processing technique, EmbedFilter achieves up to a 14.1% improvement on MTEB without any training overhead. Extensive evaluations across diverse experimental setups further demonstrate its broad applicability.
(3) We demonstrate that EmbedFilter acts as a distance-preserving transformation and enable embedding dimensionality reduction. This leads to faster retrieval and lower storage requirements, thereby facilitating the practical deployment of LLMs in large-scale text embedding applications.
2.Background
To establish the background for EmbedFilter we first review the fundamentals of embedding extraction and introduce the mechanistic interpretability tools used throughout our analysis.
2.1.Text Embedding Paradigm
We first formulate the standard process of LLM-based text embedding extraction. Our objective is to transform sentence𝑿{\bm{X}}into a dense vector𝒉∈ℝd{\bm{h}}\in\mathbb{R}^{d}, such that the similarity between these vectors can reflect their semantic similarity. Given an input sentence𝑿=[x1,x2,…,xL]{\bm{X}}=\left[x_{1},x_{2},\dots,x_{L}\right], its embedding𝒉{\bm{h}}is obtained by passing𝑿{\bm{X}}through an LLM backbone, followed by a pooling strategyP\operatorname{P}:
𝒉=P(LLM([x1,x2,…,xL])),{\bm{h}}\;=\;\operatorname{P}\left(\,\operatorname{LLM}\,(\left[\,x_{1},x_{2},\dots,x_{L}\,\right])\,\right),whereP\operatorname{P}aggregates the final layer outputs fromLLMinto add-dimensional representation𝒉{\bm{h}}. Typically, the unembedding matrix is conceptually designed to map these hidden states back to the vocabulary space for token prediction. We contend that this module has been overlooked in the context of traditional text embedding extraction and can be exploited to enhance embeddings qualities.
2.2.Text Embeddings with Prompt Engineering
Many studies have explored improving the performance of LLMs on text embedding tasks through prompt engineering. Here, we provide a brief overview of two well-established baselines:
PromptEOL(Jianget al.,2024)finds that a ”one word limitation” template can help better condense semantics into the hidden state, thereby enhancing the representation of LLM-derived embeddings.
ECHO(Springeret al.,2025)suggests that causal attention in LLMs is a bottleneck, as earlier tokens cannot access future context. To mitigate this, they duplicate the input and extract embeddings from the second occurrence, incurring overhead from the increased input size.
More sophisticated prompt-engineering methods have been proposed(Leiet al.,2024; Thirukovalluru and Dhingra,2025); however, these often necessitate intricate pipeline designs and incur substantial computational overhead. While our primary experiments focus on the aforementioned baselines, we provide a broader discussion and evaluation of these more complex strategies in our supplementary analysis.
2.3.Mechanistic Interpretability Tools
We provide an overview of two interpretability tools — Logit Lens(Belroseet al.,2023)and Logit Spectroscopy(Cancedda,2024)— which facilitate the identification of edge spectrum subspace and inspire the design of EmbedFilter.
Logit Lens(Belroseet al.,2023)represents a cornerstone of mechanistic interpretability research. Its central premise is to project a model’s intermediate representations directly into the vocabulary space. By analyzing the resulting changes in these logits, researchers can discern how specific intermediate activations shape the final predictions, thereby gaining insights into the model’s internal processing logic. Building on this framework,Nieet al.(2025)apply the Logit Lens tool to text embeddings and find that these embeddings can align with certain keywords from the input texts.
To further dissect the semantic properties of different embedding subspaces,Logit Spectroscopy(Cancedda,2024)extends Logit Lens by projecting intermediate representations onto spectral components of model’s weight matrices. Let𝑾𝒰{\bm{W}}_{\mathcal{U}}be the unembedding matrix of the LLM. Its singular value decomposition can be formulated as:
𝑾𝒰=𝑼Σ𝑽⊤,{\bm{W}}_{\mathcal{U}}\;=\;{\bm{U}}\,\Sigma\,{\bm{V}}^{\top},where𝑾𝓤∈ℝ|𝒱|×d\bm{W_{\mathcal{U}}}\in\mathbb{R}^{\left|\mathcal{V}\right|\times d}, withddrepresenting the hidden-state dimension and|𝒱||\mathcal{V}|the vocabulary size. For an arbitrary dimensioni∈{0,…,d−1}i\in\{0,\dots,d-1\}, Logit Spectroscopy introduces a filter𝚿𝒊\bm{\Psi_{i}}that removes the projection of𝒉{\bm{h}}onto theii-th right singular vector of𝑽{\bm{V}}. Formally, this transformation is defined as:
𝚿𝒊=𝑰−𝑽[i]𝑽[i]⊤.{\bm{\Psi_{i}}}\;=\;{{\bm{I}}-{\bm{V}}_{[i]}\,{\bm{V}}_{[i]}^{\top}}.This operation facilitates the spectral analysis of an LLM’s intermediate representations, enabling researchers to measure the contribution of hidden states within different spectral subspaces to the final output. Section3details how we leverage these tools to identify the ”edge spectrum” subspace.
3.Discovery of Edge Spectrum Subspace
3.1.Motivation
In this section, we present the preliminaries analyses that motivate the development of EmbedFilter. Our investigation is driven by an observed correlation between two key insights:
(1) Raw text embeddings from LLMs are typically anisotropic(Liet al.,2020; Suet al.,2021). These embeddings are concentrated in a narrow subspace, making them excessively similar to one another;
(2) LLM-derived embeddings often align with high-frequency tokens that carry little semantics.
These insights lead us to reasonably infer that the narrow subspace is responsible for encoding frequent tokens. Consequently, we seek to isolate this subspace and mitigate its impact, thereby alleviating the anisotropy problem in text embedding tasks. To accomplish this, we first reverse-engineer a ”centroid” hidden state representing the “average” token. We then perform Logit Spectroscopy on this “average” token, revealing that the edge spectrum subspace drives the emergence of high-frequency tokens. We present the technical details of this discovery below.
3.2.Reverse-Engineering of the Average Token
We leverage the unembedding matrix, together with word frequencies from training corpus, to reverse-engineer the “average” token.
3.2.1.Experimental Setup
We evaluate a diverse set of models, ranging from Qwen-2.5(Team,2024)(0.5B) to Mistral-v0.3-Instruct(Jianget al.,2023)(7B) and Llama-3.1 Instruct(Grattafioriet al.,2024)(8B). By spanning multiple scales and model families, we aim to ensure the universality of our findings.
Since pretraining datasets for these LLMs are not disclosed, we approximate their true word frequency distribution𝒑{\bm{p}}by sampling tokens from open-source corpora. Specifically, we select the RedPajama(Weberet al.,2024)dataset as our evaluation corpus. Parallel experiments on alternative corpora produce identical results. The resulting empirical statistics, denoted as𝒑^\hat{{\bm{p}}}, serve as a robust proxy for distribution𝒑{\bm{p}}and are adopted throughout the following experiments.
3.2.2.Reverse-Engineering
We outline the practical steps for reverse-engineering the ”average” token. For a standard inference step, the unembedding matrix is used to compute the probability distribution over the next token. Formally, this prediction step is given by:
𝒒=Softmax(𝒉𝑾𝒰⊤),{\bm{q}}\;=\;\operatorname{Softmax}\left(\,{\bm{h}}\,{\bm{W}}_{\mathcal{U}}^{\top}\,\right),where the probability of an arbitrary tokeniiis given by:
𝒒i=exp(𝒘i)/∑j=1|𝒱|exp(𝒘j).{\bm{q}}_{i}\;=\;\exp({\bm{w}}_{i})\;\big/\;{\textstyle\sum_{j=1}^{|\mathcal{V}|}}\exp({\bm{w}}_{j}).Given this, the logit𝒘i{\bm{w}}_{i}of theii-th token is denoted as:
𝒘i=log(𝒒i)+log∑j=1|𝒱|e𝒘j,{\bm{w}}_{i}\;=\;\log({\bm{q}}_{i})\,+\,\log\sum\nolimits_{j=1}^{|\mathcal{V}|}e^{{\bm{w}}_{j}},where the second term is a shared bias across all logits, which we redefine as𝒃{\bm{b}}. The logits for decoding𝒉{\bm{h}}is reformulated as:
𝒉𝑾𝒰⊤=log(𝒒)+𝒃.{\bm{h}}\,{\bm{W}}_{\mathcal{U}}^{\top}\;=\;\log({\bm{q}})\,+\,{\bm{b}}.By denoting the Moore–Penrose pseudo-inverse(Penrose,1955)of𝑾𝒰⊤{\bm{W}}_{\mathcal{U}}^{\top}as𝑾𝒰+{\bm{W}}_{\mathcal{U}}^{+}, we can further rewrite the preceding formula as:
𝒉=(log(𝒒)+𝒃)𝑾𝒰+.{\bm{h}}\;=\;\left(\log({\bm{q}})\,+\,{\bm{b}}\right)\,\,{\bm{W}}_{\mathcal{U}}^{+}.We substitute the observed word frequencies𝒑^\hat{{\bm{p}}}and interpret𝒉^\hat{{\bm{h}}}as the ”average” token representation over the training corpus. Formally, the average token embedding is defined as:
𝒉^=log(𝒑^)𝑾𝒰+,\hat{{\bm{h}}}\;=\;\log(\hat{{\bm{p}}})\;{\bm{W}}_{\mathcal{U}}^{+}\,,where the bias term𝒃{\bm{b}}is omitted for analytical simplicity, since it does not alter the fundamental spectral properties.
3.2.3.Logit Spectroscopy into Average Token
Having established the theoretical foundation of Logit Spectroscopy, we now detail its application to the average token. For each dimensioni∈{0,…,d−1}i\in\{0,\dots,d-1\}, we apply a filter𝚿𝒊\bm{\Psi_{i}}to remove the projection of𝒉^\hat{{\bm{h}}}onto the subspace, resulting in the perturbed representation𝒉~(i)\widetilde{{\bm{h}}}^{(i)}, defined as:
𝒉~(i)=𝒉^(𝑰−𝑽[i]𝑽[i]⊤).\widetilde{{\bm{h}}}^{(i)}\;=\;\hat{{\bm{h}}}\,\left({\bm{I}}\,-\,{\bm{V}}_{[i]}{\bm{V}}_{[i]}^{\top}\right).We analyze the logit shifts between𝒉{\bm{h}}and𝒉~(i)\widetilde{{\bm{h}}}^{(i)}for thekkmost frequent tokens in the training corpus. Let𝒱+\mathcal{V}^{+}denote this subset of frequent tokens, formally defined as𝒱+={j∣j∈argtopk(𝒑^)}\mathcal{V}^{+}=\{j\mid j\in\operatorname{argtopk}(\hat{{\bm{p}}})\}. The impact of the filtering operation is then quantified by the cumulative logit differences across these tokens, which is given as:
Δπ(i)=∑j∈𝒱+|w~j(i)−w^j|∑j∈𝒱+|w^j|,\Delta\pi^{(i)}=\frac{\sum_{j\in\mathcal{V}^{+}}\left|\widetilde{w}^{(i)}_{j}-\hat{w}_{j}\right|}{\sum_{j\in\mathcal{V}^{+}}\left|\hat{w}_{j}\right|},where𝒘j^\hat{{\bm{w}}_{j}}represents the original logit of thejj-th token, and𝒘j~(i)\widetilde{{\bm{w}}_{j}}^{(i)}denotes the logit after filtering out the subspace spanned by theii-th right singular vector of𝑾𝒰{\bm{W}}_{\mathcal{U}}. A higher value ofΔπ(i)\Delta\pi^{\mathrm{(i)}}indicates that theii-th singular subspace exerts a more pronounced influence on the representation of high-frequency tokens.
Figure 2.Δπ\Delta\pidistribution for Qwen, Llama and Mistral.Figure2presents theΔπ\Delta\pivalues when settingk=100k=100. As shown, theΔπ\Delta\pivalues are significantly larger at the edges of the spectrum, suggesting that the subspaces corresponding to the edge spectrum of LLMs are primarily responsible for encoding high-frequency tokens. This specific spectral region is precisely what we aimed to identify. As demonstrated in the following sections, filtering out this edge spectrum not only suppresses the over-representation of ”average” tokens but also enhances the quality of LLM-derived text embeddings. For comparison, Figure4visualizes the influence of different spectral subspaces on the representation of infrequent and randomly sampled tokens. Notably, the logit differences for infrequent and random tokens exhibit significantly lower sensitivity to the edge spectrum than those for frequent tokens.

Figure 3.Re-running logit lens analysis in Section1with text embeddings refined by EmbedFilter. Top-6 tokens from logit lens are displayed, with colored entries indicate tokens that have literal connections with the input text. EmbedFilter suppresses the expression of frequent tokens and enhances the semantic richness of text embeddings.
4.Text embedding with EmbedFilter
Building on our preliminary insights, we propose EmbedFilter, a simple linear transformation to filter out the edge spectrum subspace. This section provides an overview of the EmbedFilter workflow. Additionally, we present a dimensionality reduction approach based on EmbedFilter to highlight its efficiency.
Table 1.Performance of EmbedFilter across MTEB tasks.τ\taucontrols dimensionality reduction, scaling the output dimensionality to1/τ1/\tauof the original size. Colored entries highlight improvements over the vanilla baseline, while bold text mark the best results within each setup. Parenthetical values indicate the performance gain of EmbedFilter compared to its baseline.STS.Class.Cluster.PairClass.Rerank.Retr.Sum.Avg.↑\uparrowNum. Datasets (→\rightarrow)101211348149Qwen2.5-0.5BPromptEOL63.0469.2034.9155.1549.3327.3127.3050.07+ EmbFilter (τ=2\mathbf{\tau}=2)\cellcolorMediumPurple!1569.48\cellcolorMediumPurple!1570.32\cellcolorMediumPurple!1539.20\cellcolorMediumPurple!1564.72\cellcolorMediumPurple!1551.28\cellcolorMediumPurple!1534.7327.12\cellcolorMediumPurple!1554.57 (+9.0%)+ EmbFilter (τ=4\mathbf{\tau}=4)\cellcolorMediumPurple!15 68.5768.92\cellcolorMediumPurple!15 38.24\cellcolorMediumPurple!15 64.54\cellcolorMediumPurple!15 50.62\cellcolorMediumPurple!15 32.85\cellcolorMediumPurple!15 27.67\cellcolorMediumPurple!15 53.47 (+6.8%)+ EmbFilter (τ=8\mathbf{\tau}=8)\cellcolorMediumPurple!15 68.0366.07\cellcolorMediumPurple!15 35.50\cellcolorMediumPurple!15 63.57\cellcolorMediumPurple!15 49.70\cellcolorMediumPurple!15 29.82\cellcolorMediumPurple!1528.37\cellcolorMediumPurple!15 51.43 (+2.7%)ECHO63.9864.8630.1655.5442.8018.1522.7846.03+ EmbFilter (τ=2\mathbf{\tau}=2)\cellcolorMediumPurple!1570.77\cellcolorMediumPurple!1567.37\cellcolorMediumPurple!1536.94\cellcolorMediumPurple!1566.35\cellcolorMediumPurple!1546.59\cellcolorMediumPurple!1529.65\cellcolorMediumPurple!15 29.73\cellcolorMediumPurple!1552.55 (+14.1%)+ EmbFilter (τ=4\mathbf{\tau}=4)\cellcolorMediumPurple!15 69.64\cellcolorMediumPurple!15 65.59\cellcolorMediumPurple!15 36.17\cellcolorMediumPurple!15 65.33\cellcolorMediumPurple!15 46.40\cellcolorMediumPurple!15 28.61\cellcolorMediumPurple!1531.65\cellcolorMediumPurple!15 51.50 (+11.9%)+ EmbFilter (τ=8\mathbf{\tau}=8)\cellcolorMediumPurple!15 68.8161.91\cellcolorMediumPurple!15 34.80\cellcolorMediumPurple!15 63.57\cellcolorMediumPurple!15 46.13\cellcolorMediumPurple!15 25.42\cellcolorMediumPurple!15 29.79\cellcolorMediumPurple!15 49.43 (+7.4%)Llama-3.1-8B-InstructPromptEOL75.1973.3939.3064.2253.6725.4525.4955.13+ EmbFilter (τ=2\mathbf{\tau}=2)\cellcolorMediumTurquoise!1576.66\cellcolorMediumTurquoise!1573.78\cellcolorMediumTurquoise!1540.67\cellcolorMediumTurquoise!1566.64\cellcolorMediumTurquoise!1554.68\cellcolorMediumTurquoise!15 29.69\cellcolorMediumTurquoise!15 27.39\cellcolorMediumTurquoise!1556.79 (+3.0%)+ EmbFilter (τ=4\mathbf{\tau}=4)\cellcolorMediumTurquoise!15 76.63\cellcolorMediumTurquoise!15 73.73\cellcolorMediumTurquoise!15 40.57\cellcolorMediumTurquoise!15 66.63\cellcolorMediumTurquoise!15 54.65\cellcolorMediumTurquoise!1529.86\cellcolorMediumTurquoise!15 27.51\cellcolorMediumTurquoise!15 56.78 (+3.0%)+ EmbFilter (τ=8\mathbf{\tau}=8)\cellcolorMediumTurquoise!15 76.3373.10\cellcolorMediumTurquoise!15 40.32\cellcolorMediumTurquoise!15 66.41\cellcolorMediumTurquoise!15 54.41\cellcolorMediumTurquoise!15 29.70\cellcolorMediumTurquoise!1527.93\cellcolorMediumTurquoise!15 56.46 (+2.4%)ECHO70.4368.8038.8966.9849.2630.1425.4153.52+ EmbFilter (τ=2\mathbf{\tau}=2)\cellcolorMediumTurquoise!1574.41\cellcolorMediumTurquoise!1569.77\cellcolorMediumTurquoise!1542.64\cellcolorMediumTurquoise!1573.98\cellcolorMediumTurquoise!1553.15\cellcolorMediumTurquoise!1539.21\cellcolorMediumTurquoise!15 28.46\cellcolorMediumTurquoise!1557.70 (+7.8%)+ EmbFilter (τ=4\mathbf{\tau}=4)\cellcolorMediumTurquoise!15 74.20\cellcolorMediumTurquoise!15 69.13\cellcolorMediumTurquoise!15 42.28\cellcolorMediumTurquoise!15 73.94\cellcolorMediumTurquoise!15 53.07\cellcolorMediumTurquoise!15 38.64\cellcolorMediumTurquoise!1528.97\cellcolorMediumTurquoise!15 57.32 (+7.1%)+ EmbFilter (τ=8\mathbf{\tau}=8)\cellcolorMediumTurquoise!15 74.0567.50\cellcolorMediumTurquoise!15 41.88\cellcolorMediumTurquoise!15 73.76\cellcolorMediumTurquoise!15 52.75\cellcolorMediumTurquoise!15 37.75\cellcolorMediumTurquoise!15 28.58\cellcolorMediumTurquoise!15 56.61 (+5.8%)Mistral-7B-Instruct-v0.3PromptEOL64.1571.2633.4058.5148.1020.9124.7249.47+ EmbFilter (τ=2\mathbf{\tau}=2)\cellcolorDodgerBlue!15 66.5971.17\cellcolorDodgerBlue!15 36.16\cellcolorDodgerBlue!15 62.07\cellcolorDodgerBlue!15 49.63\cellcolorDodgerBlue!15 24.5924.33\cellcolorDodgerBlue!15 51.50 (+4.1%)+ EmbFilter (τ=4\mathbf{\tau}=4)\cellcolorDodgerBlue!15 67.5570.92\cellcolorDodgerBlue!15 37.41\cellcolorDodgerBlue!15 63.29\cellcolorDodgerBlue!15 50.11\cellcolorDodgerBlue!1525.9724.66\cellcolorDodgerBlue!15 52.26 (+5.6%)+ EmbFilter (τ=8\mathbf{\tau}=8)\cellcolorDodgerBlue!1568.1170.07\cellcolorDodgerBlue!1538.04\cellcolorDodgerBlue!1563.67\cellcolorDodgerBlue!1550.20\cellcolorDodgerBlue!15 25.92\cellcolorDodgerBlue!1525.79\cellcolorDodgerBlue!1552.35 (+5.8%)ECHO72.8171.6032.4271.4847.5628.3731.4953.21+ EmbFilter (τ=2\mathbf{\tau}=2)\cellcolorDodgerBlue!15 74.66\cellcolorDodgerBlue!1571.79\cellcolorDodgerBlue!15 36.14\cellcolorDodgerBlue!1574.96\cellcolorDodgerBlue!15 51.66\cellcolorDodgerBlue!15 35.0331.23\cellcolorDodgerBlue!15 56.10 (+5.4%)+ EmbFilter (τ=4\mathbf{\tau}=4)\cellcolorDodgerBlue!15 74.8571.05\cellcolorDodgerBlue!1537.07\cellcolorDodgerBlue!15 74.91\cellcolorDodgerBlue!1551.87\cellcolorDodgerBlue!1535.4931.14\cellcolorDodgerBlue!1556.25 (+5.7%)+ EmbFilter (τ=8\mathbf{\tau}=8)\cellcolorDodgerBlue!1574.8670.00\cellcolorDodgerBlue!15 36.92\cellcolorDodgerBlue!15 74.29\cellcolorDodgerBlue!15 51.71\cellcolorDodgerBlue!15 34.91\cellcolorDodgerBlue!1531.56\cellcolorDodgerBlue!15 55.82 (+4.9%)
4.1.Methodology Formulation of EmbedFilter.
We introduce the Bulk Spectrum Transformation (𝚽r\bm{\Phi}_{r}), to filter out the edge spectrum space of raw LLM-derived text embeddings. By excluding the right singular vectors associated with both the largest and smallest singular values, we construct𝚽r\bm{\Phi}_{r}from the remaining mid-range singular components. We hypothesize that this ”bulk” of the spectrum suppresses the influence of non-semantic tokens, thereby enabling a more effective capture of core semantics within the embedding space. Formally, the matrix𝚽r\bm{\Phi}_{r}is defined as:
𝚽τ=𝑽[lτ:rτ]𝑽[lτ:rτ]⊤,\bm{\Phi}_{\tau}\;=\;{\bm{V}}{\left[l_{\tau}:r_{\tau}\right]\,{\bm{V}}{\left[l_{\tau}:r_{\tau}\right]}^{\top}},whereτ\tauis a predefined filtering ratio, withlτl_{\tau}andrτr_{\tau}denoting the start and end indices of the columns. We use this transformation to post-process the existing embeddings{𝒆i}i=1N\left\{{\bm{e}}_{i}\right\}_{i=1}^{N}, and map them into refined representations𝒆i~\widetilde{{\bm{e}}_{i}}optimized for downstream tasks:
𝒆i~=𝒆i𝚽𝝉⊤.\widetilde{{\bm{e}}_{i}}\;=\;{\bm{e}}_{i}\,\bm{\Phi_{\tau}}^{\top}.This transformation safely filters out the edge spectrum space while preserving the components in the bulk spectrum. Further implementation details can be found in our code repository. We then use EmbedFilter to refine the text embeddings and re-run the Logit Lens analysis, with the corresponding before-and-after comparisons presented in Figure3.
4.2.Dimensionality Reduction
Moreover, we observe that text embeddings refined by EmbedFilter facilitate dimensionality reduction for free. Recall that𝑽{\bm{V}}represents the right singular vectors of𝑾𝒰{\bm{W}}_{\mathcal{U}}. Since𝑽{\bm{V}}is an orthogonal matrix, it constitutes, by definition, a distance-preserving transformation. Given that, for any𝒙,𝒚∈ℝd{\bm{x}},{\bm{y}}\in\mathbb{R}^{d}, we have the identity:
(1)∥𝒙𝚽𝝉⊤−𝒚𝚽𝝉⊤∥2=∥𝒙𝑽[lτ:rτ]−𝒚𝑽[lτ:rτ]∥2.\|{\bm{x}}\,\bm{\Phi_{\tau}}^{\top}-{\bm{y}}\,\bm{\Phi_{\tau}}^{\top}\|_{2}\;=\;\|{\bm{x}}\,{\bm{V}}{\left[l_{\tau}:r_{\tau}\right]}-{\bm{y}}\,{\bm{V}}{\left[l_{\tau}:r_{\tau}\right]}\|_{2}.Given the properties presented in Equation1, we can replace𝚽𝒓⊤\bm{\Phi_{r}}^{\top}with𝑽[lτ:rτ]{\bm{V}}\left[l_{\tau}:r_{\tau}\right], which causes no theoretical difference in similarity measurement. For readers unfamiliar with these properties, we also provide a simple proof of Equation1in the AppendixB.
By invoking this identity transformation, we substantially reduce the hidden size of the raw text embeddings. This reduction translates to reduced index storage overhead and faster retrieval speeds, as it minimizes both memory bandwidth bottlenecks and distance computation complexity during search. Our experimental results in Section5demonstrate that this approach successfully achieves significant dimensionality reduction while maintaining or even exceeding downstream task performance, thereby achieving improvements in both efficiency and effectiveness simultaneously.
5.Experiment
5.1.General Setup.
We evaluate EmbedFilter’s effectiveness on the MTEB benchmark(Muennighoffet al.,2023), which includes standard downstream applications for text embeddings such as Semantic Textual Similarity (STS), Classification (Class.), Clustering (Cluster.), and Retrieval (Retr.). We build our evaluation framework upon the official MTEB implementation and report the standard metrics for each task. Due to limited computational resources, we evaluate a subset of the retrieval tasks, following the protocols in(BehnamGhaderet al.,2024; Li and Zhou,2025). Detailed descriptions of the experimental configurations and subset selection can be found in AppendixA. We evaluate EmbedFilter across three backbone LLMs (Qwen, Llama, and Mistral), ensuring comprehensive coverage of mainstream architectures and model scales.
5.2.Main Results on MTEB.
Table1presents the main experimental results of EmbedFilter on MTEB, configured with both PromptEOL and ECHO. Specifically, we analyze EmbedFilter’s performance with different filtering ratios to assess its sensitivity. We have the following observations:
Table 2.Performance of EmbedFilter on MTEB via MetaEOL prompting.Table 3.Performance of EmbedFilter on STS tasks under the GenEOL framework.(1) EmbedFilter demonstrates notable improvements across all experimental setups, providing strong evidence of its effectiveness and robustness. Specifically, EmbedFilter delivers remarkable enhancements over the baselines, achieving up to a 14% increase in MTEB overall performance. These performance gains are maintained even when the output embedding size is reduced to only1/81/8of its original dimension. Furthermore, EmbedFilter consistently achieves superior overall performance across all evaluated setups, whereas the prompt-engineering methods exhibits performance fluctuations. This underscores the generalization capability of EmbedFilter and highlights its potential for integration with a broader spectrum of LLMs.
(2) EmbedFilter introduces only a lightweight linear transformation module, ensuring negligible overhead during the post-processing of large-scale text embeddings. Additional experimental results in Table2and3, demonstrate that EmbedFilter remains highly effective even when integrated into sophisticated prompt-engineering pipelines, such as MetaEOL(Leiet al.,2024)and GenEOL(Thirukovalluru and Dhingra,2025). Unlike these complex frameworks — which requires iterative calls to powerful commercial LLMs or the aggregation of multiple embeddings for a single sentence — EmbedFilter bypasses the heavy computational overhead of these complex extraction framework design, leading to superior downstream performance with higher efficiency.
5.3.The Effect of Filtering Ratioτ\tau
As aforementioned, we introduce a hyperparameterτ\tauto represent the filtering ratio in EmbedFilter. Consequently, the dimensionality of text embeddings is reduced to1/τ1/\tauof the original size. This reduction is critical, as it scales down the index storage to1/τ1/\tauof its previous occupation and theoretically result inτ×\tau\timesspeedup in similarity computation. A larger value ofτ\tauindicates lower memory usage and faster retrieval speeds, which is especially beneficial in real-world applications. Based on this, we analyze the impact ofτ\tauon the performance of EmbedFilterȦs shown in Table1, EmbedFilter consistently delivers improvement acorss different choices ofτ\tau. Remarkably, it retains competitive, and in some cases, superior performance on MTEB tasks, even at a high filtering ratio ofτ=8\tau=8.
Large language models typically have larger hidden sizes, leading to increased storage and computational costs when deployed as embedding models. By incorporating EmbedFilter, LLMs can attain improved downstream performance with smaller representation dimensions. We present the dimensionality reduction performance of Llama-3.1-8B-Instruct with EmbedFilter in Table4. With the aid of EmbedFilter, zero-shot LLMs can outperform established, well-trained baselines from the pre-LLM era, such as SimCSE(Gaoet al.,2021)and coCondensor(Gao and Callan,2022), while utilizing smaller representation dimensions. This advancement enables the direct deployment of LLMs as embedding models in low-resource scenarios.
Table 4.Dimensionality reduction performance of Llama with EmbedFilter on MTEB.Table 5.Ablation studies of the filtering strategies. Best results are in bold.Table 6.MTEB results for EmbedFilter and whitening. Best results are highlighted in bold.
5.4.Ablation Studies of Filtering Strategies
We evaluate various configurations of our filtering strategies to verify the effectiveness of the EmbedFilter design. Specifically, we conduct a detailed ablation analysis using Qwen2.5-0.5B with PromptEOL and a dimensionality reduction ratio ofτ=2\tau=2. The results across these different experimental setups are reported in Table5. We can draw the following conclusions:
(1) The improvement of EmbedFilter does not stem from a simple reduction in the dimensionality of text embeddings. For configuration1, we truncate the first half of the dimensions from the original text embeddings, following the Matryoshka setup(Kusupatiet al.,2022). In configuration2, we randomly choose half of the dimensions from the originaldd-dimensional vector to form the reduced embeddings. Configuration1and2have fewer vector dimensions but still underperform the vanilla PromptEOL. Therefore, we contend that the improvements brought by EmbedFilter are not merely due to the reduction in the dimensionality.
(2) EmbedFilter provides the most effective strategy for subspace filtering. Our comparisons include configuration3through5, where we selectively filter the right singular subspaces associated with the largest (Dominant), smallest (Secondary), and intermediate (Bulk) singular values, respectively. Compared to these variants, EmbedFilter achieves the best downstream performance. Notably, configuration5— the inverse operation of EmbedFilter — obtains the poorest results. Moreover, we find that Configuration4significantly outperforms5. This finding is in line with theΔπ\Delta\pidistribution shown in Figure2, where the secondary subspace exhibits a greater tendency to encode frequent tokens than the dominant subspace. We leave the exploration of optimal strategies for filtering the asymmetric edge spectrum subspace to future work.
(3) EmbedFilter is remarkably effective, nearly reaching the theoretical upper bound of our framework’s potential. In configuration6, we identify singular vectors with the largestΔπ(i)\Delta\pi^{\mathrm{(i)}}based on our analysis in Section3and filter out the corresponding subspace. We regard this configuration as the theoretical upper bound of EmbedFilter’s capability. As shown in Table5, EmbedFilter performs competitively with configuration6while requiring no task-specific calibration and being significantly simpler to implement.
5.5.Comparison between EmbedFilterand Embedding Calibration Baselines
We also compare EmbedFilter with established embedding calibration baselines. These methods typically derive text embeddings from a calibration dataset and propose improvements based on the resulting statistical properties. A representative baseline is Bert-whitening(Suet al.,2021), which addresses the anisotropic issue by applying a whitening operation to the text embeddings. Notably, BERT-whitening also facilitates dimensionality reduction consequently.
Given this, we compare EmbedFilter and whitening on Qwen and setτ=2\tau=2. We follow the experimental setups from(Suet al.,2021), and report the results with supervision of NLI dataset(Bowmanet al.,2015). Their results on MTEB are presented in Table6. While whitening helps improve the performance, EmbedFilter still outperforms it without the supervision of any calibration data. We argue that the unembedding matrix of LLMs captures valuable statistical features during the pretraining phase that have been previously overlooked. We did not include this method as a baseline in Table1, as its reliance on calibration data would lead to an unfair comparison with EmbedFilter.
While EmbedFilter is primarily heuristic, we also provide a whitening perspective to help understand. In effect, it can be interpreted as a whitening-like operation within bulk spectral space:
𝒆~i=𝒆i𝚽r⊤=∑j=lτrταj𝒗j,whereαj=proj𝒗j𝒆i.\widetilde{{\bm{e}}}_{i}\;=\;{\bm{e}}_{i}\,\bm{\Phi}_{r}^{\top}\;=\;\sum_{j=l_{\tau}}^{r_{\tau}}\alpha_{j}\,{\bm{v}}_{j},\qquad\text{where}\;\,\alpha_{j}\;=\;\operatorname{proj}_{{\bm{v}}_{j}}{\bm{e}}_{i}.Text embeddings exhibit more uniform projections onto directions associated with mid-range singular values, providing a relatively isotropic subspace for free. We leave a deeper investigation into the underlying mechanisms of this phenomenon to future work, and we hope this perspective will inspire readers and inform future advancements in text embedding training.
6.Conclusion
In this paper, we investigate the suboptimal zero-shot performance of LLMs on text embedding tasks and provide a mechanistic interpretation. Through an analysis of the model’s unembedding matrix, we discover the edge spectrum space, which is responsible for encoding high-frequency tokens into the embedding space. Motivated by this finding, we introduce EmbedFilter, a simple linear transformation to filter out this spectrum space. Our experiments across multiple LLM backbones demonstrate that applying EmbedFilter leads to superior zero-shot improvements on text embedding tasks. Crucially, we also find that this filtering design implicitly reduces the effective dimensionality of the embeddings, thereby lowering index storage overhead and accelerating retrieval. We hope our findings provide insights and inspire more principled designs to improve text embeddings training.
Acknowledgment
This work is supported by Lenovo Group. We thank Ang Lv for his writing suggestions and guidance during the rebuttal phase. We are also grateful to Yuhan Liu and Yankai Lin for providing computational resources and API access. Additionally, we sincerely acknowledge the anonymous KDD reviewers for their constructive comments and questions, which have greatly improved this work.
References
- P. BehnamGhader, V. Adlakha, M. Mosbach, D. Bahdanau, N. Chapados, and S. Reddy (2024)LLM2Vec: large language models are secretly powerful text encoders.External Links:2404.05961,LinkCited by:§1,§5.1.
- N. Belrose, Z. Furman, L. Smith, J. Wu, B. Ge, A. Trakhtenberg, M. Shah, and J. Gurney (2023)Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112.Cited by:§1,§2.3,§2.3.
- A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, and M. Hagen (2020)Overview of touché 2020: argument retrieval: extended abstract.InExperimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings,Berlin, Heidelberg,pp. 384–395.External Links:ISBN 978-3-030-58218-0,Link,DocumentCited by:Appendix A.
- V. Boteva, D. Gholipour, A. Sokolov, and S. Riezler (2016)A full-text learning to rank dataset for medical information retrieval.InProceedings of the European Conference on Information Retrieval (ECIR),Cited by:Appendix A.
- S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015)A large annotated corpus for learning natural language inference.InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,L. Màrquez, C. Callison-Burch, and J. Su (Eds.),Lisbon, Portugal,pp. 632–642.External Links:Link,DocumentCited by:§5.5.
- N. Cancedda (2024)Spectral filters, dark signals, and attention sinks.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),L. Ku, A. Martins, and V. Srikumar (Eds.),Bangkok, Thailand,pp. 4792–4808.External Links:Link,DocumentCited by:§1,§2.3,§2.3.
- A. Cohan, S. Feldman, I. Beltagy, D. Downey, and D. S. Weld (2020)SPECTER: document-level representation learning using citation-informed transformers.InACL,Cited by:Appendix A.
- DeepSeek-AI (2026)DeepSeek-v4: towards highly efficient million-token context intelligence.Cited by:§1.
- K. Ethayarajh (2019)How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.),Hong Kong, China,pp. 55–65.External Links:Link,DocumentCited by:§1.
- L. Gao and J. Callan (2022)Unsupervised corpus aware language model pre-training for dense passage retrieval.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),S. Muresan, P. Nakov, and A. Villavicencio (Eds.),Dublin, Ireland,pp. 2843–2853.External Links:Link,DocumentCited by:§5.3.
- T. Gao, X. Yao, and D. Chen (2021)SimCSE: simple contrastive learning of sentence embeddings.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M. Moens, X. Huang, L. Specia, and S. W. Yih (Eds.),Online and Punta Cana, Dominican Republic,pp. 6894–6910.External Links:Link,DocumentCited by:§5.3.
- A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan,et al.(2024)The llama 3 herd of models.arXiv preprint arXiv:2407.21783.Cited by:§1,§3.2.1.
- A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed (2023)Mistral 7b.External Links:2310.06825,LinkCited by:§3.2.1.
- T. Jiang, S. Huang, Z. Luan, D. Wang, and F. Zhuang (2024)Scaling sentence embeddings with large language models.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.),Miami, Florida, USA,pp. 3182–3196.External Links:Link,DocumentCited by:§1,§1,§2.2.
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020)Scaling laws for neural language models.arXiv preprint arXiv:2001.08361.Cited by:§1.
- A. Kusupati, G. Bhatt, A. Rege, M. Wallingford, A. Sinha, V. Ramanujan, W. Howard-Snyder, K. Chen, S. Kakade, P. Jain,et al.(2022)Matryoshka representation learning.Advances in Neural Information Processing Systems35,pp. 30233–30249.Cited by:§5.4.
- Y. Lei, D. Wu, T. Zhou, T. Shen, Y. Cao, C. Tao, and A. Yates (2024)Meta-task prompting elicits embeddings from large language models.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),L. Ku, A. Martins, and V. Srikumar (Eds.),Bangkok, Thailand,pp. 10141–10157.External Links:Link,DocumentCited by:§1,§2.2,§5.2.
- B. Li, H. Zhou, J. He, M. Wang, Y. Yang, and L. Li (2020)On the sentence embeddings from pre-trained language models.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),pp. 9119–9130.Cited by:§1,§3.1.
- Z. Li and T. Zhou (2025)Your mixture-of-experts LLM is secretly an embedding model for free.InThe Thirteenth International Conference on Learning Representations,External Links:LinkCited by:§1,§5.1.
- A. Lv, Y. Chen, K. Zhang, Y. Wang, L. Liu, J. Wen, J. Xie, and R. Yan (2024)Interpreting key mechanisms of factual recall in transformer-based language models.External Links:2403.19521,LinkCited by:§1.
- M. Maia, S. Handschuh, A. Freitas, B. Davis, R. McDermott, M. Zarrouk, and A. Balahur (2018)WWW’18 open challenge: financial opinion mining and question answering.pp. 1941–1942.Cited by:Appendix A.
- N. Muennighoff, N. Tazi, L. Magne, and N. Reimers (2023)MTEB: massive text embedding benchmark.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,A. Vlachos and I. Augenstein (Eds.),Dubrovnik, Croatia,pp. 2014–2037.External Links:Link,DocumentCited by:Appendix A,§5.1.
- Z. Nie, R. Zhang, and Z. Wu (2025)A text is worth several tokens: text embedding from LLMs secretly aligns well with the key tokens.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar (Eds.),Vienna, Austria,pp. 7683–7694.External Links:Link,Document,ISBN 979-8-89176-251-0Cited by:§2.3.
- R. Penrose (1955)A generalized inverse for matrices.Proceedings of the Cambridge Philosophical Society51(3),pp. 406–413.External Links:DocumentCited by:§3.2.2.
- K. Roberts, T. Alam, S. Bedrick, D. Demner-Fushman, K. Lo, I. Soboroff, E. Voorhees, L. L. Wang, and W. R. Hersh (2021)Searching for scientific evidence in a pandemic: an overview of trec-covid.External Links:2104.09632Cited by:Appendix A.
- J. M. Springer, S. Kotha, D. Fried, G. Neubig, and A. Raghunathan (2025)Repetition improves language model embeddings.InThe Thirteenth International Conference on Learning Representations,External Links:LinkCited by:§1,§2.2.
- J. Su, J. Cao, W. Liu, and Y. Ou (2021)Whitening sentence representations for better semantics and faster retrieval.External Links:2103.15316,LinkCited by:§3.1,§5.5,§5.5.
- Q. Team (2024)Qwen2.5: a party of foundation models.External Links:LinkCited by:§1,§3.2.1.
- N. Thakur, N. Reimers, A. Rücklé, A. Srivastava, and I. Gurevych (2021)BEIR: a heterogeneous benchmark for zero-shot evaluation of information retrieval models.InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2),External Links:LinkCited by:Appendix A.
- R. Thirukovalluru and B. Dhingra (2025)GenEOL: harnessing the generative power of LLMs for training-free sentence embeddings.InFindings of the Association for Computational Linguistics: NAACL 2025,L. Chiruzzo, A. Ritter, and L. Wang (Eds.),Albuquerque, New Mexico,pp. 2295–2308.External Links:Link,Document,ISBN 979-8-89176-195-7Cited by:§1,§2.2,§5.2.
- H. Wachsmuth, S. Syed, and B. Stein (2018)Retrieval of the best counterargument without prior topic knowledge.InACL,Cited by:Appendix A.
- D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, and H. Hajishirzi (2020)Fact or fiction: verifying scientific claims.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),Online,pp. 7534–7550.External Links:Link,DocumentCited by:Appendix A.
- M. Weber, D. Fu, Q. Anthony, Y. Oren, S. Adams, A. Alexandrov, X. Lyu, H. Nguyen, X. Yao, V. Adams, B. Athiwaratkun, R. Chalamala, K. Chen, M. Ryabinin, T. Dao, P. Liang, C. Ré, I. Rish, and C. Zhang (2024)RedPajama: an open dataset for training large language models.External Links:2411.12372,LinkCited by:§3.2.1.
Appendix ADetails of the Main Experimental Setup
In this section, we provide additional details about the experimental setups discussed in Section5. We evaluate all tasks from MTEB, including semantic textual similarity (STS.), classification (Class.), clustering (Cluster.), pair classification (PairClass.), re-ranking (Rerank.), retrieval (Retr.), and summarization (Sum.). Due to limited computational resources, we evaluate a subset of the retrieval tasks, consisting of eight datasets(Muennighoffet al.,2023): SciFact(Waddenet al.,2020), ArguAna(Wachsmuthet al.,2018), NFCorpus(Botevaet al.,2016), FiQA2018(Maiaet al.,2018), QuoraRetrieval(Thakuret al.,2021), SCIDOCS(Cohanet al.,2020), Touche2020(Bondarenkoet al.,2020), TRECCOVID(Robertset al.,2021). Finally we use the metrics recommended by MTEB, showing in Table7, where the Spearman’s correlation is calculated on cosine similarity. For EmbedFilter used on Mistral-7B-Instruct-V0.3, we offset the whole indices untillτ=128l_{\tau}=128. We provide the actual prompts used for PromptEOL and ECHO across different models below; ”text” denotes the sentences to be embedded.
PromptEOLQwenSummarize the sentence: ”{text}” in one word:”LlamaSummarize the sentence: ”{text}” in one word:”MistralThis sentence: ”{text}” means [MASK]
ECHOQwenRewrite the following paragraph: {text}. The rewritten paragraph: {text}LlamaRewrite the following paragraph: {text}. The rewritten paragraph: {text}MistralRewrite the following paragraph: {text}. The rewritten paragraph: {text}
Table 7.Evaluation metrics used for MTEB tasks.
Appendix BEquivalence Transformation Proof
In the main text, we define the projection matrix as:
Φτ=𝑽[lτ:rτ]𝑽[lτ:rτ]⊤.\Phi_{\tau}={\bm{V}}[l_{\tau}:r_{\tau}]\,{\bm{V}}[l_{\tau}:r_{\tau}]^{\top}.Let𝑽τ=𝑽[lτ:rτ]{\bm{V}}_{\tau}={\bm{V}}[l_{\tau}:r_{\tau}]for simplicity, we seek to prove the identity:
‖𝒙𝚽𝝉⊤−𝒚𝚽𝝉⊤‖2=‖𝒙𝑽τ−𝒚𝑽τ‖2.\|{\bm{x}}\,\bm{\Phi_{\tau}}^{\top}-{\bm{y}}\,\bm{\Phi_{\tau}}^{\top}\|_{2}=\|{\bm{x}}\,{\bm{V}}_{\tau}-{\bm{y}}\,{\bm{V}}_{\tau}\|_{2}.Let𝒛=𝒙−𝒚{\bm{z}}={\bm{x}}-{\bm{y}}. The left-hand side can be written as:
‖𝒙𝚽𝝉⊤−𝒚𝚽𝝉⊤‖2=‖𝒛𝚽𝝉⊤‖2=‖𝒛𝑽τ𝑽τ⊤‖2,\|{\bm{x}}\,\bm{\Phi_{\tau}}^{\top}-{\bm{y}}\,\bm{\Phi_{\tau}}^{\top}\|_{2}\;=\;\|{\bm{z}}\,\bm{\Phi_{\tau}}^{\top}\|_{2}\;=\;\|{\bm{z}}\,{\bm{V}}_{\tau}\,{\bm{V}}_{\tau}^{\top}\|_{2},considering that𝑽τ𝑽τ⊤{\bm{V}}_{\tau}\,{\bm{V}}_{\tau}^{\top}is identity, thus we have:
‖𝒛𝑽τ𝑽τ⊤‖2=‖𝒛‖2=‖𝒙−𝒚‖2.\|{\bm{z}}\,{\bm{V}}_{\tau}\,{\bm{V}}_{\tau}^{\top}\|_{2}\;=\;\|{\bm{z}}\|_{2}\;=\;\|{\bm{x}}-{\bm{y}}\|_{2}.this completes the proof of the identity in equation1.
Figure 4.Δπ\Delta\pidistribution for high-frequency, low-frequency and randomly sampled tokens on the Qwen model.
相似文章
你的逆嵌入矩阵实际上是文本嵌入的特征透镜
本文指出,LLM文本嵌入过度表达了高频无信息词元,并提出EmbedFilter,一种线性变换,通过滤除该子空间来改善语义表示并实现降维。
@Potatoloogs: LLM 内部究竟怎么运作:从 token 到 next-token,九个核心机制完整梳理 a)Tokenization:模型读的不是文字,是整数 · 文本先被切成 subword 片段,再映射成整数 ID;现代 LLM 词表通常有数万到数…
本文从 tokenization 到 next-token 预测,系统梳理了现代 LLM 内部的九个核心机制,包括 tokenization、embedding、位置编码、注意力、多头注意力、前馈网络等,并比较了不同模型的架构差异。
大型语言模型是如何工作的(26分钟阅读)
详细讲解基于Transformer的大型语言模型的工作原理,涵盖分词、嵌入、注意力机制和下一个词元预测,无需复杂数学。
语言感知的非失真性LLM水印
介绍了LUNA,一种语言感知的LLM水印方法,实现了跨多语言的非失真嵌入和无模型检测,显著提升了AUROC和困惑度保持。
大语言模型实际工作原理
深入剖析现代大语言模型的工作原理,涵盖从分词到下一个词预测的核心机制,无需复杂数学知识。