Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

Hugging Face Daily Papers 06/05/26, 12:00 AM Papers

Summary

The paper identifies that LLM text embeddings overly express high-frequency uninformative tokens and proposes EmbedFilter, a linear transformation that filters out this subspace to improve semantic representations and enable dimensionality reduction.

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.

Original Article

View Cached Full Text

Cached at: 06/08/26, 07:14 AM

Paper page - Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

Source: https://huggingface.co/papers/2606.07502

Abstract

Text embeddings from large language models are enhanced by EmbedFilter, a linear transformation that reduces the influence of high-frequency tokens and improves semantic representations while enabling dimensionality reduction.

Large language modelsexhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation:text embeddingstend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression ofhigh-frequency tokenssuppresses the model’s ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simplelinear transformationdesigned to refinetext embeddingsderived from LLMs directly. Specifically, we uncover that theunembedding matrixwithin LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence ofhigh-frequency tokens, thereby enhancingsemantic representations. As a compelling byproduct, this enables an inherentdimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improvetext embeddingstraining. Our code is available at https://github.com/CentreChen/EmbFilter.

View arXiv page View PDF GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2606\.07502

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.07502 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.07502 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.07502 in a Space README.md to link it from this page.

Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

Paper page - Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

Your Embedding Model is SMARTer Than You Think

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

@mixedbreadai: By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows. But they contain t…

Submit Feedback

Similar Articles

@vintcessun: Turns out LLM text embeddings are hijacked by high-frequency tokens (periods, articles)! The unembedding matrix implicitly defines a low-rank subspace dominated by these uninformative expressions. This is the root cause of LLMs' poor performance as universal embeddings, and the contamination is subtle. EmbedFilter…

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

Your Embedding Model is SMARTer Than You Think

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

@mixedbreadai: By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows. But they contain t…