Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

Hugging Face Daily Papers Papers

Summary

The paper identifies that LLM text embeddings overly express high-frequency uninformative tokens and proposes EmbedFilter, a linear transformation that filters out this subspace to improve semantic representations and enable dimensionality reduction.

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.
Original Article
View Cached Full Text

Cached at: 06/08/26, 07:14 AM

Paper page - Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

Source: https://huggingface.co/papers/2606.07502

Abstract

Text embeddings from large language models are enhanced by EmbedFilter, a linear transformation that reduces the influence of high-frequency tokens and improves semantic representations while enabling dimensionality reduction.

Large language modelsexhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation:text embeddingstend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression ofhigh-frequency tokenssuppresses the model’s ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simplelinear transformationdesigned to refinetext embeddingsderived from LLMs directly. Specifically, we uncover that theunembedding matrixwithin LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence ofhigh-frequency tokens, thereby enhancingsemantic representations. As a compelling byproduct, this enables an inherentdimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improvetext embeddingstraining. Our code is available at https://github.com/CentreChen/EmbFilter.

View arXiv pageView PDFGitHub2Add to collection

Get this paper in your agent:

hf papers read 2606\.07502

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.07502 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.07502 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.07502 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

@vintcessun: Turns out LLM text embeddings are hijacked by high-frequency tokens (periods, articles)! The unembedding matrix implicitly defines a low-rank subspace dominated by these uninformative expressions. This is the root cause of LLMs' poor performance as universal embeddings, and the contamination is subtle. EmbedFilter…

X AI KOLs Timeline

This study reveals that LLM text embeddings are hijacked by high-frequency tokens (e.g., periods, articles) and proposes EmbedFilter, which performs SVD on the unembedding matrix and subtracts the projection component to release true semantics, achieving zero-training-cost dimensionality reduction and retrieval efficiency gains.

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

arXiv cs.LG

Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.

Your Embedding Model is SMARTer Than You Think

Hugging Face Daily Papers

SMART is a framework that unlocks latent multi-vector capabilities in single-vector models for multimodal retrieval, improving state-of-the-art performance with reduced computational costs via contrastive training and late-interaction inference.