text-embedding

Tag

Cards List
#text-embedding

@vintcessun: Turns out LLM text embeddings are hijacked by high-frequency tokens (periods, articles)! The unembedding matrix implicitly defines a low-rank subspace dominated by these uninformative expressions. This is the root cause of LLMs' poor performance as universal embeddings, and the contamination is subtle. EmbedFilter…

X AI KOLs Timeline · 14h ago Cached

This study reveals that LLM text embeddings are hijacked by high-frequency tokens (e.g., periods, articles) and proposes EmbedFilter, which performs SVD on the unembedding matrix and subtracts the projection component to release true semantics, achieving zero-training-cost dimensionality reduction and retrieval efficiency gains.

0 favorites 0 likes
#text-embedding

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

arXiv cs.CL · 2026-06-03 Cached

SEA-Embedding presents a fully open and reproducible text embedding pipeline for Southeast Asian languages, trained solely on public data, achieving state-of-the-art results on the SEA-BED benchmark.

0 favorites 0 likes
#text-embedding

Hubness, Not Anisotropy, Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models

arXiv cs.CL · 2026-05-27 Cached

This paper investigates the cause of cross-lingual retrieval asymmetry in multilingual embedding models. The authors propose and test the hub-mediation hypothesis, finding that hubness, not anisotropy, is the dominant cause, and recommend using CSLS instead of cosine similarity.

0 favorites 0 likes
#text-embedding

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

arXiv cs.CL · 2026-05-22 Cached

This paper evaluates four text chunking strategies for Retrieval-Augmented Generation on Khmer agricultural documents, finding that character-based Recursive chunking with 300 characters yields the best retrieval and relevance performance.

0 favorites 0 likes
← Back to home

Submit Feedback