Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

turkish-nlp low-resource sentence-embeddings semantic-search rag tokenizer-optimization distillation

Summary

This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built via cross-lingual tokenizer surgery and offline distillation, achieving strong performance on Turkish benchmarks with a cost-quality trade-off.

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of 5-20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:36 PM

Paper page - Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Source: https://huggingface.co/papers/2605.29992 This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built through cross-lingual tokenizer surgery and offline embedding distillation. Instead of expensive full pretraining, we adapt a multilingual embedding model by constructing a Turkish-optimized 131k vocabulary tokenizer, cloning the teacher architecture with a compatible embedding table, and distilling from precomputed teacher vectors.

The resulting 200M-parameter model supports an 8,192-token context window and achieves 77.55% Pearson / 77.45% Spearman on STSbTR, outperforming the 300M-parameter teacher model. On TR-MTEB, it reaches a 63.9% mean score, ranking 7th among 26 models while offering a strong cost-quality trade-off.

All artifacts are released, including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling. The work is relevant for Turkish NLP, low-resource language adaptation, sentence embeddings, semantic search, RAG, tokenizer optimization, and efficient distillation.

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Paper page - Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Similar Articles

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

@liquidai: Introducing LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: two multilingual retrieval models built for ultra-fast and a…

beautyyuyanli/multilingual-e5-large

Multilingual Sentence Embeddings for Linguistic-Integrated Reliability Audit

Submit Feedback

Similar Articles

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

@liquidai: Introducing LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: two multilingual retrieval models built for ultra-fast and a…

beautyyuyanli/multilingual-e5-large
Multilingual E5-large embedding model is now available on Replicate, costing ~$0.00098 per run and completing in ~1 second on Nvidia L40S.

Multilingual Sentence Embeddings for Linguistic-Integrated Reliability Audit