SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

Hugging Face Daily Papers 05/25/26, 12:00 AM Papers

cross-lingual sparse-encoders retrieval multilingual semantic-alignment fine-tuning zero-shot

Summary

SemBridge is a novel embedding initialization method that leverages multilingual bridge models to establish semantic alignments between source and target vocabularies, improving cross-lingual sparse encoder adaptation and retrieval performance across multiple languages.

Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed for cross-lingual adaptation in sparse encoders by leveraging multilingual bridge models. SemBridge establishes semantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence during fine-tuning and improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superior zero-shot retrieval performance and consistently improves retrieval performance after fine-tuning compared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.

Original Article

View Cached Full Text

Cached at: 05/26/26, 06:43 AM

Paper page - SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

Source: https://huggingface.co/papers/2605.26002

Abstract

SemBridge enhances cross-lingual sparse encoder adaptation by using multilingual bridge models to establish semantic alignments and improve retrieval performance across multiple languages.

Sparse encodersoffer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed forcross-lingual adaptationinsparse encodersby leveragingmultilingual bridge models. SemBridge establishessemantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence duringfine-tuningand improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superiorzero-shot retrievalperformance and consistently improvesretrieval performanceafterfine-tuningcompared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.26002

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.26002 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.26002 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.26002 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

Paper page - SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Why Advanced Encoders Lag on Sparse Retrieval? The Answer and an Approach to Bridging Vocabulary Gaps

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

Multilingual Sentence Embeddings for Linguistic-Integrated Reliability Audit

Submit Feedback

Similar Articles

Why Advanced Encoders Lag on Sparse Retrieval? The Answer and an Approach to Bridging Vocabulary Gaps

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

Multilingual Sentence Embeddings for Linguistic-Integrated Reliability Audit