Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Hugging Face Daily Papers Papers

Summary

This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built via cross-lingual tokenizer surgery and offline distillation, achieving strong performance on Turkish benchmarks with a cost-quality trade-off.

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of 5-20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:36 PM

Paper page - Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Source: https://huggingface.co/papers/2605.29992 This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built through cross-lingual tokenizer surgery and offline embedding distillation. Instead of expensive full pretraining, we adapt a multilingual embedding model by constructing a Turkish-optimized 131k vocabulary tokenizer, cloning the teacher architecture with a compatible embedding table, and distilling from precomputed teacher vectors.

The resulting 200M-parameter model supports an 8,192-token context window and achieves 77.55% Pearson / 77.45% Spearman on STSbTR, outperforming the 300M-parameter teacher model. On TR-MTEB, it reaches a 63.9% mean score, ranking 7th among 26 models while offering a strong cost-quality trade-off.

All artifacts are released, including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling. The work is relevant for Turkish NLP, low-resource language adaptation, sentence embeddings, semantic search, RAG, tokenizer optimization, and efficient distillation.

Similar Articles

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

arXiv cs.CL

This paper introduces m3BERT, a multilingual bidirectional encoder with a novel pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions, enabling a single model to be adapted to varied resource constraints. It significantly outperforms state-of-the-art models on the Bing-Click industrial retrieval dataset.

beautyyuyanli/multilingual-e5-large

Replicate Explore

Multilingual E5-large embedding model is now available on Replicate, costing ~$0.00098 per run and completing in ~1 second on Nvidia L40S.

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

Hugging Face Daily Papers

Introduces MulTaBench, a benchmark of 40 datasets for multimodal tabular learning with text and image modalities, demonstrating that task-specific embedding tuning improves performance over frozen pretrained embeddings, particularly when modalities provide complementary predictive signals.