Tag
This study reveals that LLM text embeddings are hijacked by high-frequency tokens (e.g., periods, articles) and proposes EmbedFilter, which performs SVD on the unembedding matrix and subtracts the projection component to release true semantics, achieving zero-training-cost dimensionality reduction and retrieval efficiency gains.
SEA-Embedding presents a fully open and reproducible text embedding pipeline for Southeast Asian languages, trained solely on public data, achieving state-of-the-art results on the SEA-BED benchmark.
This paper investigates the cause of cross-lingual retrieval asymmetry in multilingual embedding models. The authors propose and test the hub-mediation hypothesis, finding that hubness, not anisotropy, is the dominant cause, and recommend using CSLS instead of cosine similarity.
This paper evaluates four text chunking strategies for Retrieval-Augmented Generation on Khmer agricultural documents, finding that character-based Recursive chunking with 300 characters yields the best retrieval and relevance performance.