Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
Summary
This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built via cross-lingual tokenizer surgery and offline distillation, achieving strong performance on Turkish benchmarks with a cost-quality trade-off.
View Cached Full Text
Cached at: 06/02/26, 03:36 PM
Paper page - Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
Source: https://huggingface.co/papers/2605.29992 This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built through cross-lingual tokenizer surgery and offline embedding distillation. Instead of expensive full pretraining, we adapt a multilingual embedding model by constructing a Turkish-optimized 131k vocabulary tokenizer, cloning the teacher architecture with a compatible embedding table, and distilling from precomputed teacher vectors.
The resulting 200M-parameter model supports an 8,192-token context window and achieves 77.55% Pearson / 77.45% Spearman on STSbTR, outperforming the 300M-parameter teacher model. On TR-MTEB, it reaches a 63.9% mean score, ranking 7th among 26 models while offering a strong cost-quality trade-off.
All artifacts are released, including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling. The work is relevant for Turkish NLP, low-resource language adaptation, sentence embeddings, semantic search, RAG, tokenizer optimization, and efficient distillation.
Similar Articles
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
This paper introduces m3BERT, a multilingual bidirectional encoder with a novel pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions, enabling a single model to be adapted to varied resource constraints. It significantly outperforms state-of-the-art models on the Bing-Click industrial retrieval dataset.
beautyyuyanli/multilingual-e5-large
Multilingual E5-large embedding model is now available on Replicate, costing ~$0.00098 per run and completing in ~1 second on Nvidia L40S.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
Introduces MulTaBench, a benchmark of 40 datasets for multimodal tabular learning with text and image modalities, demonstrating that task-specific embedding tuning improves performance over frozen pretrained embeddings, particularly when modalities provide complementary predictive signals.
SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges
SemBridge is a novel embedding initialization method that leverages multilingual bridge models to establish semantic alignments between source and target vocabularies, improving cross-lingual sparse encoder adaptation and retrieval performance across multiple languages.
Discovering Lexical Gaps Using Embeddings from Multilingual LLMs
This paper proposes a data-driven framework using embeddings from multilingual LLMs to detect lexical gaps between languages, achieving high accuracy in Korean-English pairs.