SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges
Summary
SemBridge is a novel embedding initialization method that leverages multilingual bridge models to establish semantic alignments between source and target vocabularies, improving cross-lingual sparse encoder adaptation and retrieval performance across multiple languages.
View Cached Full Text
Cached at: 05/26/26, 06:43 AM
Paper page - SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges
Source: https://huggingface.co/papers/2605.26002
Abstract
SemBridge enhances cross-lingual sparse encoder adaptation by using multilingual bridge models to establish semantic alignments and improve retrieval performance across multiple languages.
Sparse encodersoffer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed forcross-lingual adaptationinsparse encodersby leveragingmultilingual bridge models. SemBridge establishessemantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence duringfine-tuningand improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superiorzero-shot retrievalperformance and consistently improvesretrieval performanceafterfine-tuningcompared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.26002
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.26002 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.26002 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.26002 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English
LLMBridge introduces an LLM-based pipeline for end-to-end referential bridging resolution, achieving state-of-the-art performance on three English datasets. The system combines heuristic pre/post-processing with LLM natural language inference.
Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
This paper introduces a principled approach to multilingual language steering using sparse autoencoders (SAEs) trained on multilingual data and a novel layer selection rule based on the intersection of multilingual alignment and language separability, evaluated on LLaMA-3.1-8B and Gemma-2-9B for machine translation and cross-lingual summarization.
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
This paper proposes using reinforcement learning with semantic rewards (via GRPO) to expand LLMs to low-resource languages without the typical alignment tax of catastrophic forgetting, showing improved semantic quality and transferability over supervised fine-tuning.
Discovering Lexical Gaps Using Embeddings from Multilingual LLMs
This paper proposes a data-driven framework using embeddings from multilingual LLMs to detect lexical gaps between languages, achieving high accuracy in Korean-English pairs.
Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
This paper uses sparse autoencoders to decompose LLMs into interpretable features and shows that semantic features explain brain alignment with cortical semantic topography, generalizing across English, Chinese, and French.