Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings
Summary
The paper identifies that LLM text embeddings overly express high-frequency uninformative tokens and proposes EmbedFilter, a linear transformation that filters out this subspace to improve semantic representations and enable dimensionality reduction.
View Cached Full Text
Cached at: 06/08/26, 07:14 AM
Paper page - Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings
Source: https://huggingface.co/papers/2606.07502
Abstract
Text embeddings from large language models are enhanced by EmbedFilter, a linear transformation that reduces the influence of high-frequency tokens and improves semantic representations while enabling dimensionality reduction.
Large language modelsexhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation:text embeddingstend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression ofhigh-frequency tokenssuppresses the model’s ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simplelinear transformationdesigned to refinetext embeddingsderived from LLMs directly. Specifically, we uncover that theunembedding matrixwithin LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence ofhigh-frequency tokens, thereby enhancingsemantic representations. As a compelling byproduct, this enables an inherentdimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improvetext embeddingstraining. Our code is available at https://github.com/CentreChen/EmbFilter.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.07502
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.07502 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.07502 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.07502 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
@vintcessun: Turns out LLM text embeddings are hijacked by high-frequency tokens (periods, articles)! The unembedding matrix implicitly defines a low-rank subspace dominated by these uninformative expressions. This is the root cause of LLMs' poor performance as universal embeddings, and the contamination is subtle. EmbedFilter…
This study reveals that LLM text embeddings are hijacked by high-frequency tokens (e.g., periods, articles) and proposes EmbedFilter, which performs SVD on the unembedding matrix and subtracts the projection component to release true semantics, achieving zero-training-cost dimensionality reduction and retrieval efficiency gains.
Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects
Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.
Your Embedding Model is SMARTer Than You Think
SMART is a framework that unlocks latent multi-vector capabilities in single-vector models for multimodal retrieval, improving state-of-the-art performance with reduced computational costs via contrastive training and late-interaction inference.
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs
This paper identifies a shared latent mechanism across diverse backdoor behaviors in LLMs, using sparse autoencoders to detect and causally suppress these features, enabling unified backdoor detection and mitigation across models and attack types.
@mixedbreadai: By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows. But they contain t…
Single-vector embedding models can be used to extract sparse latent terms, and BM25 can turn this vocabulary into a strong retriever.