Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

arXiv cs.CL 06/18/26, 04:00 AM Papers
turkish morphology tokenization word-embedding neural agglutinative-language nlp
Summary
This paper presents Morpheus, a neural tokenizer and word embedder for Turkish that learns morpheme boundaries without string normalization, achieving lossless tokenization and competitive embeddings for lexical retrieval, while using less GPU memory than subword tokenizers.
arXiv:2606.18717v1 Announce Type: new Abstract: Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.
Original Article
View Cached Full Text
Cached at: 06/18/26, 05:45 AM
# Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
Source: [https://arxiv.org/html/2606.18717](https://arxiv.org/html/2606.18717)
###### Abstract

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and—in the case of WordPiece and rule\-based analyzers—failing to decode their output back to the original text\. This paper presentsMorpheus, a neural morpheme\-boundary model for Turkish that is at once a lossless, morphology\-aware tokenizer and a word\-embedding producer\. A differentiable Poisson–binomial dynamic program turns per\-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, sodecode\(encode\(w\)\)=w\\mathrm\{decode\}\(\\mathrm\{encode\}\(w\)\)=wholds by construction\. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding\. Among reversible tokenizers—the only ones valid for generation—Morpheus attains the lowest bits\-per\-character \(1\.4251\.425\), roughly doubles the gold morphological alignment of the subword family \(MorphScore macro\-F10\.610\.61vs\.∼0\.32\{\\sim\}0\.32\), and uses∼19%\{\\sim\}19\\%less GPU memory than 64K\-vocabulary subword tokenizers\. As an embedder, frozen Morpheus vectors lead on lexical retrieval \(root\-family MAP0\.850\.85\) and same\-root verification \(ROC\-AUC1\.001\.00\), surpassing the multilingual retriever BGE\-M3 and BERTurk; on context\- and inflection\-dependent tasks \(NER, case/number probing\) the heavier contextual encoders remain ahead—a trade\-off we attribute to Morpheus’s root\-centric geometry\. Code:[https://github\.com/lonewolf\-rd/TurkishMorpheus](https://github.com/lonewolf-rd/TurkishMorpheus); model:[https://huggingface\.co/lonewolflab/Morpheus\-TR\-50K](https://huggingface.co/lonewolflab/Morpheus-TR-50K); interactive demo:[https://huggingface\.co/spaces/lonewolflab/morpheus\-tr\-demo](https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo)\.

Morpheus: A Morphology\-Aware Neural Tokenizer and Word Embedder for Turkish

Şakar, TolgaIndependent Researcherlonewolf\_rd@protonmail\.com

## 1Introduction

Turkish is an agglutinative language that encodes most of its semantic content in productive chains of derivational and inflectional suffixes attached to a root; a single root can unfold into hundreds of distinct surface forms through the ordering of its morphemes \(e\.g\.ev“house”→\\rightarrowevlerimizdekiler“the ones in our houses”\)\. The unit that carries meaning in Turkish is therefore the morpheme, not the word and not a frequency\-driven fragment of it\. This property places two distinct demands on the machinery of modern Turkish NLP—one on*tokenization*and one on*word representation*—and, as argued below, current tools meet each of them only partially\.

#### The tokenization problem\.

Subword tokenizers such as BPE, WordPiece, and UnigramSennrich et al\. \([2016](https://arxiv.org/html/2606.18717#bib.bib16)\); Kudo and Richardson \([2018](https://arxiv.org/html/2606.18717#bib.bib14)\)segment words by corpus statistics rather than morphology, and on Turkish this produces two concrete failures\. First, several widely used tokenizers are*not reversible*: decoding the ids back to text does not recover the original string\. WordPiece strips Turkish diacritics \(ç, ğ, ı, ö, ş, ü\) and the rule\-based TurkishTokenizer applies canonical re\-harmonization, so a non\-trivial fraction of inflected words cannot be reconstructed\. In a generative LLM, where every generated token id must decode to faithful text, this loss directly corrupts model output and silently degrades any task that reads the decoded string\. Second, because semantically loaded suffixes are cut at arbitrary positions, words are over\-fragmented: more tokens are emitted per word \(higher fertility\), which inflates sequence length, compute, and memory at both training and inference time\. Unsupervised morphological segmenters such as MorfessorCreutz and Lagus \([2007](https://arxiv.org/html/2606.18717#bib.bib9)\)and rule\-based analyzers such as ZemberekAkın and Akın \([2007](https://arxiv.org/html/2606.18717#bib.bib1)\)address the morphological\-alignment side, but the former is not optimized for language modeling and the latter is lossy and dictionary\-bound\. In short, existing tokenizers each answer part of the problem—either reversibility, or morphological alignment, or low fertility—but none answers all three at once\.

#### The representation problem\.

The same morphological richness also strains Turkish word representation\. Contextual encoders such as BERTurkSchweter \([2020](https://arxiv.org/html/2606.18717#bib.bib15)\)provide strong embeddings, but they are heavyweight \(∼\\sim110M\+ parameters\), tied to their own lossy subword vocabularies, and treat morphology only implicitly\. A representation in which morphologically related forms \(kitap,kitaplar,kitabımız\) sit together by construction—rather than only after large\-scale pretraining— remains absent\. More fundamentally, tokenization and representation are currently solved by*two separate systems*: a tokenizer produces discrete ids that carry no meaning, and a distinct, much larger model must be trained to turn those ids into vectors\. For an agglutinative language, where the boundary information needed to tokenize well and the structure needed to represent well are one and the same morphological signal, this separation is wasteful\.

#### This paper\.

Taken together, these gaps motivate a single Turkish model that is simultaneously a*lossless, morphology\-aware tokenizer*and a*structured word\-embedding producer*\. This paper aims to provide exactly that, and introducesMorpheus, a neural morpheme\-boundary model for Turkish\. Morpheus combines boundary supervision from an unsupervised analyzer \(Morfessor\) with self\-supervised objectives \(skip\-gram negative sampling, root\-family contrastive learning, and masked language modeling\), and segments words through a differentiable Poisson\-binomial dynamic program: gradients flow over soft morpheme memberships during training, while inference recovers exact hard boundaries with no architectural switch and no string normalization\. Because no normalization is applied, the emitted pieces*are*the surface form, sodecode\(encode\(w\)\)=w\\mathrm\{decode\}\(\\mathrm\{encode\}\(w\)\)=wholds by construction\. And because the model is neural, the same forward pass that tokenizes also yields, as a by\-product, a structuredℝ320\\mathbb\{R\}^\{320\}embedding per word—making Morpheus a tokenizer and a word\-embedding model at once\.

The contributions of this paper are:

- •Morpheus, a neural morphology\-aware tokenizer for Turkish that is lossless without inference\-time normalization, via a differentiable Poisson\-binomial soft segmentation that unifies training and inference\.
- •A demonstration that the*same model*is a word\-embedding producer, evaluated against contextual encoders \(BERTurk\) and a strong multilingual retriever \(BGE\-M3\) on root\-family retrieval, lexical dedup, morphological probing, and Turkish NER—characterizing where a morphology\-derived embedding helps and where it does not\.
- •A comprehensive evaluation suite—reversibility, MorphScore, SIGMORPHON, surface fidelity, and language\-modeling BPC—that cleanly establishes the lossless\-vs\-lossy distinction against the subword family and existing Turkish tokenizers\.

## 2Related Work

#### Subword tokenization and its limits for Turkish\.

BPE\(Sennrich et al\.,[2016](https://arxiv.org/html/2606.18717#bib.bib16)\), WordPiece\(Devlin et al\.,[2019](https://arxiv.org/html/2606.18717#bib.bib10)\), and Unigram\(Kudo,[2018](https://arxiv.org/html/2606.18717#bib.bib13)\), implemented at scale through SentencePiece\(Kudo and Richardson,[2018](https://arxiv.org/html/2606.18717#bib.bib14)\), are the de facto interface between text and modern language models\. A growing body of work shows that this frequency\-driven design is not neutral for morphologically rich languages such as Turkish\.Toraman et al\. \([2023](https://arxiv.org/html/2606.18717#bib.bib18)\)compare five tokenizers at different granularities and find that a morphological\-level tokenizer is competitive with the de facto ones while responding more strongly to vocabulary size, and that the ratio of vocabulary to model parameters is itself a design variable\.Kaya and Tantuğ \([2024](https://arxiv.org/html/2606.18717#bib.bib12)\)study vocabulary size for Turkish BERT models across NER, sentiment, and QA, andAltinok \([2026](https://arxiv.org/html/2606.18717#bib.bib2)\)present a systematic evaluation of the data–vocabulary–morphology interplay under matched parameter budgets, together with morphology\-aware diagnostics \(boundary F1, lemma atomicity, over\-/under\-segmentation\)\. These studies quantify the cost of frequency\-driven segmentation; Morpheus instead attacks it at the source, by learning morpheme boundaries with a neural model\.

#### Morphology\-aware and linguistically informed tokenizers\.

The unsupervised Morfessor family\(Creutz and Lagus,[2002](https://arxiv.org/html/2606.18717#bib.bib8),[2007](https://arxiv.org/html/2606.18717#bib.bib9)\)induces morpheme\-like units via a minimum\-description\-length objective and remains a standard segmentation baseline for agglutinative languages; we use it as the boundary teacher for Morpheus\. Rule\-based analyzers such as Zemberek\(Akın and Akın,[2007](https://arxiv.org/html/2606.18717#bib.bib1)\)encode Turkish morphology explicitly but are dictionary\-bound\. More recent Turkish\-specific tokenizers improve linguistic alignment in different ways:Bayram et al\. \([2025a](https://arxiv.org/html/2606.18717#bib.bib3)\)propose a hybrid tokenizer \(TurkishTokenizer\) that combines dictionary\-driven root/affix segmentation, phonological normalization mapping allomorphic variants to shared identifiers, and a subword fallback, reporting strong Turkish\-token and purity rates and competitive STS and TurBLiMP results;Gulgonul \([2025](https://arxiv.org/html/2606.18717#bib.bib11)\)exploit the closed syllable inventory of Turkish for a resource\-light, retrieval\-oriented tokenizer\. These methods raise morphological alignment, but they do so through runtime normalization \(which discards surface information, e\.g\. mapping allomorphs to a canonical id\) or through fixed dictionaries and syllable inventories\. Morpheus differs on two axes: it learns boundaries neurally rather than from a lexicon, and it applies no normalization, so segmentation is surface\-preserving and exactly invertible—while, uniquely, the same model also yields word embeddings\.

#### Evaluation standards for Turkish tokenization\.

Bayram et al\. \([2025b](https://arxiv.org/html/2606.18717#bib.bib4)\)and its conference counterpart\(Bayram et al\.,[2025c](https://arxiv.org/html/2606.18717#bib.bib5)\)introduce the TR\-MMLU benchmark and the Turkish\-token \(%TR\) and pure\-token \(%Pure\) metrics, arguing that linguistic alignment of tokens correlates with downstream performance more strongly than raw token purity\. We adopt the %TR/%Pure protocol for vocabulary\-level comparison and complement it with metrics that prior comparisons largely omit: exact reversibility, gold morpheme F1 \(MorphScore\), SIGMORPHON inflection alignment, surface\-string fidelity, and bits\-per\-character under a parameter\-equalized language\-model budget\. Together these make explicit the lossless\-versus\-lossy axis that, as we show, separates tokenizers that are valid for generation from those that are not\.

#### Turkish word representations and the tokenizer–embedding gap\.

On the representation side, BERTurk\(Schweter,[2020](https://arxiv.org/html/2606.18717#bib.bib15)\)provides strong contextual Turkish embeddings, and recent work adapts multilingual encoders to Turkish—e\.g\.Bayram et al\. \([2026](https://arxiv.org/html/2606.18717#bib.bib6)\)perform cross\-lingual tokenizer surgery and offline distillation to build a Turkish sentence\-embedding model, while general multilingual retrievers such as BGE\-M3\(Chen et al\.,[2024](https://arxiv.org/html/2606.18717#bib.bib7)\)are competitive on Turkish out of the box\. All of these treat representation as a system separate from—and much larger than—the tokenizer\. Morpheus instead couples the two: a single neural model both tokenizes losslessly and emits a morphology\-derived embedding, and we evaluate that embedding directly against BERTurk and BGE\-M3\.

## 3Methodology

### 3\.1Data and preprocessing

Morpheus is trained on a large\-scale monolingual Turkish corpus that combines a multi\-register author corpus with the full cleaned Turkish Wikipedia \(∼\\sim10 GB of raw text\), assembled to expose the model to diverse morphological constructions across four registers: Ekşisözlük \(informal/colloquial, rich in spoken\-language suffixation\), Dergipark \(academic, derivational morphology and terminology\), Turkish news sites \(standard journalistic\), and Turkish Wikipedia \(encyclopedic, broad vocabulary\)\. The web\-sourced registers were collected and cleaned with a companion scraping toolkit that documents per\-source extraction, HTML/URL stripping, Unicode normalization, and deduplication; the Wikipedia portion is additionally filtered for Turkish\-alphabet coverage, stopword/length thresholds, and markup, then deduplicated\. All text is processed with Turkish\-aware case folding \(İ→i\\textit\{\\\.\{I\}\}\\\!\\rightarrow\\\!\\textit\{i\},I→ı\\textit\{I\}\\\!\\rightarrow\\\!\\textit\{\\OT1\\i\}\), with the original casing retained as a per\-character side channel rather than discarded\.

### 3\.2Caching, supervision, and splits

The corpus is split95/595/5into train and test partitions with a fixed seed\. To remove per\-epoch segmentation overhead, each sentence is pre\-tokenized once into a cached tensor bundle containing, per word: character ids \(padded tomax\_word\_len=32\\text\{max\\\_word\\\_len\}=32\), per\-character case flags, a\(max\_word\_len−1\)\(\\text\{max\\\_word\\\_len\}\-1\)binary boundary\-label vector from the Morfessor teacher, a word id against a120120K word vocabulary, and a root id against a3030K root vocabulary \(the root being the first Morfessor segment\), together with a sentence attention mask\. The boundary labels are produced by Morfessor\(Creutz and Lagus,[2007](https://arxiv.org/html/2606.18717#bib.bib9)\)and then*root\-corrected*: for in\-dictionary words, intra\-root Morfessor boundaries are removed when an independent root lexicon agrees on the root span, reducing root over\-segmentation\. This correction is applied only to the training labels and is purely positional—it never rewrites strings—so Morpheus remains surface\-preserving at inference\. For Morpheus training the sentence cache is capped at900900K \(train\) /100100K \(validation\) sentences, while the word and root vocabularies are built from the full corpus; the separate11M\-line cap referred to later applies only to the downstream language\-model evaluation \(Section[4\.6](https://arxiv.org/html/2606.18717#S4.SS6)\), not to Morpheus itself\.

### 3\.3Model architecture

Morpheus maps a word, given as a character sequence, to a set of morpheme boundaries and a single word embedding in one forward pass, through three stages connected by a differentiable segmentation operator\. All hidden states share a working dimension ofd=320d=320\.

#### Character encoder and positional morphology\.

Each character embedding is concatenated with a learned case\-flag embedding, passed through a multi\-scale convolution \(kernel widths22–66\) that captures local characternn\-grams, and then through33self\-attention layers, producing context\-aware character vectorsH=\(h1,…,hL\)∈ℝL×dH=\(h\_\{1\},\\dots,h\_\{L\}\)\\in\\mathbb\{R\}^\{L\\times d\}\. A defining property of Turkish is that morpheme identity is governed by*position relative to the root*: suffixes attach in a fixed slot order \(number, then possessive, then case\), so the same surface syllable plays a different role depending on how many morphemes precede it\. Inev — ler — imiz — de\(“in our houses”\),\-leris plural in the first post\-root slot,\-imizfirst\-person\-plural possessive in the second, and\-delocative in the third\. The model must therefore reason about*offsets between characters*—how far a candidate boundary is from the previous one—rather than their absolute indices\. For this reason both the character encoder and the boundary detector apply Rotary Position Embedding \(RoPE\)\(Su et al\.,[2021](https://arxiv.org/html/2606.18717#bib.bib17)\)on each attention head’s subspace, injecting*relative*offsets directly into the attention dot\-product so that a single learned pattern \(e\.g\. “two characters past the previous boundary”\) generalizes across roots of different lengths\.

#### Boundary detector\.

A stack of44RoPE self\-attention layers overHH, followed by an adjacent\-pair scoring head, emits for each inter\-character position a boundary probability

pi=σ\(score\(hi,hi\+1\)\)∈\[0,1\]p\_\{i\}\\;=\\;\\sigma\\\!\\big\(\\mathrm\{score\}\(h\_\{i\},h\_\{i\+1\}\)\\big\)\\in\[0,1\]\(1\)for each inter\-character positioni=1,…,L−1i=1,\\dots,L\-1\. The vector𝐩=\(p1,…,pL−1\)\\mathbf\{p\}=\(p\_\{1\},\\dots,p\_\{L\-1\}\)is the only interface to the rest of the model: everything downstream is a differentiable function of𝐩\\mathbf\{p\}\.

#### Differentiable Poisson–binomial segmentation\.

The central difficulty is turning soft per\-position boundary probabilities into discrete morpheme segments*without*a non\-differentiablearg⁡max\\arg\\max/threshold that would block gradients from the semantic objectives back to the boundary detector\. We resolve it with a Poisson–binomial dynamic program that computes, in closed form, the soft assignment of each character to each segment\. Letbi∈\{0,1\}b\_\{i\}\\in\\\{0,1\\\}be the latent boundary indicator at positioniiwithPr⁡\[bi=1\]=pi\\Pr\[b\_\{i\}\\\!=\\\!1\]=p\_\{i\}, taken independent\. Characterjjbelongs to segmentkk\(0\-indexed\) exactly whenkkboundaries occur before it, i\.e\.∑i<jbi=k\\sum\_\{i<j\}b\_\{i\}=k\. Since thepip\_\{i\}differ,∑i<jbi\\sum\_\{i<j\}b\_\{i\}follows a*Poisson–binomial*distribution, whose mass is accumulated by

fj\[k\]=fj−1\[k\]\(1−pj−1\)\+fj−1\[k−1\]pj−1,f\_\{j\}\[k\]\\;=\\;f\_\{j\-1\}\[k\]\\,\(1\-p\_\{j\-1\}\)\\;\+\\;f\_\{j\-1\}\[k\-1\]\\,p\_\{j\-1\},\(2\)with base casef1\[0\]=1f\_\{1\}\[0\]=1andfj\[k\]=Pr⁡\[∑i<jbi=k\]f\_\{j\}\[k\]=\\Pr\[\\sum\_\{i<j\}b\_\{i\}=k\]\. The resulting matrixM\[j,k\]=fj\[k\]∈ℝL×SM\[j,k\]=f\_\{j\}\[k\]\\in\\mathbb\{R\}^\{L\\times S\}\(withSSthe maximum number of segments and∑kM\[j,k\]=1\\sum\_\{k\}M\[j,k\]=1\) is a*soft segment\-membership*matrix: rowjjis a distribution over which morpheme characterjjbelongs to\. Equation \([2](https://arxiv.org/html/2606.18717#S3.E2)\) is differentiable in𝐩\\mathbf\{p\}, costsO\(LS\)O\(LS\), and has three properties exploited by design\.\(i\) Differentiability:gradients from the word\-level objectives flow throughMMinto the boundary detector, so boundaries are shaped both by the teacher and by what produces good embeddings\.\(ii\) Soft/hard duality:aspi→\{0,1\}p\_\{i\}\\\!\\to\\\!\\\{0,1\\\}each row ofMMconverges to one\-hot, recovering exact hard segmentation; the same module yields soft memberships in training and discrete morphemes at inference, switched only by the training flag\.\(iii\) Surface preservation:MMonly*groups*characters—it never inserts, drops, or rewrites them—so concatenating the segments reproduces the input word, which is whydecode\(encode\(w\)\)=w\\mathrm\{decode\}\(\\mathrm\{encode\}\(w\)\)=wholds by construction\.

#### Segment pooling and the word embedding\.

Each segmentkkis summarized by attention\-pooling the character vectors weighted by their membership,sk=∑jαjkhjs\_\{k\}=\\sum\_\{j\}\\alpha\_\{jk\}h\_\{j\}withαjk∝M\[j,k\]exp⁡\(a\(hj\)\)\\alpha\_\{jk\}\\propto M\[j,k\]\\exp\(a\(h\_\{j\}\)\)for a learned scorera\(⋅\)a\(\\cdot\), so that within\-segment characters compete while cross\-segment leakage is suppressed byMM\. The word embedding is the mean of the valid segment vectors followed by a two\-layer feed\-forward network with residual LayerNorm,ew=LayerNorm\(FFN\(1S′∑ksk\)\)∈ℝ320e\_\{w\}=\\mathrm\{LayerNorm\}\(\\mathrm\{FFN\}\(\\frac\{1\}\{S^\{\\prime\}\}\\sum\_\{k\}s\_\{k\}\)\)\\in\\mathbb\{R\}^\{320\}\. Becauseewe\_\{w\}comes from the same forward pass that yields the boundaries, the morpheme structure that defines the tokenization is exactly the structure pooled into the embedding—the architectural basis for treating Morpheus as a tokenizer and an embedder at once\.

### 3\.4Training

The total loss is a weighted sum of four terms,

ℒ=wauxℒaux\+wsgnsℒsgns\+wctrℒctr\+wmlmℒmlm\.\\mathcal\{L\}=w\_\{\\text\{aux\}\}\\mathcal\{L\}\_\{\\text\{aux\}\}\+w\_\{\\text\{sgns\}\}\\mathcal\{L\}\_\{\\text\{sgns\}\}\+w\_\{\\text\{ctr\}\}\\mathcal\{L\}\_\{\\text\{ctr\}\}\+w\_\{\\text\{mlm\}\}\\mathcal\{L\}\_\{\\text\{mlm\}\}\.\(3\)ℒaux\\mathcal\{L\}\_\{\\text\{aux\}\}is a deep\-supervised boundary BCE plus a count regularizer against the \(root\-corrected\) Morfessor labels; its weight follows a curriculum, decaying geometrically from0\.500\.50to0\.080\.08over1010epochs so the teacher anchors early training and then yields to the distributional signals\.ℒsgns\\mathcal\{L\}\_\{\\text\{sgns\}\}is skip\-gram negative sampling \(1616negatives,±6\\pm 6window,120120K context vocabulary\);ℒctr\\mathcal\{L\}\_\{\\text\{ctr\}\}is an InfoNCE contrastive loss on root identity \(the Morfessor first segment, temperature0\.100\.10\); andℒmlm\\mathcal\{L\}\_\{\\text\{mlm\}\}is a vocabulary\-free character\-level reconstruction in which20%20\\%of words in a sentence are masked and regenerated character\-by\-character by a small encoder–decoder\. We optimize with AdamW, a cosine learning\-rate schedule, and gradient clipping, using an effective batch of512512\(batch256×256\\timesgradient accumulation22\) for1010epochs\. TF32 matmuls are enabled while loss components are computed in FP32 for numerical stability; AMP/BF16 is left off for reproducibility\. Training runs in roughly3030minutes per epoch \(∼\\sim5 hours total\) on a single NVIDIA A1008080GB\. Training dynamics—loss convergence, the per\-objective curves, the aux\-weight curriculum, and optimization stability—are reported in Section[4\.1](https://arxiv.org/html/2606.18717#S4.SS1)\.

## 4Results

### 4\.1Training dynamics

Figure[1](https://arxiv.org/html/2606.18717#S4.F1)shows that the total train and validation loss decrease smoothly and track each other without divergence, while the boundary detector’s precision, recall, and F1 rise quickly and then plateau—confirming that the Morfessor\-supervised objective is learned early\. The four objectives converge jointly \(Figure[2](https://arxiv.org/html/2606.18717#S4.F2)\): the auxiliary boundary loss drops fastest as the teacher anchors the early epochs, while the skip\-gram, contrastive, and MLM losses continue to shape the embedding geometry afterwards\. Figure[3](https://arxiv.org/html/2606.18717#S4.F3)documents the optimization regime behind these curves: the cosine learning\-rate schedule, the geometric decay of the auxiliary weight from0\.500\.50to0\.080\.08that realizes the teacher\-to\-distributional curriculum, and a gradient norm that stays bounded throughout—evidence that running in full precision \(AMP off\) yields a stable, reproducible trajectory\.

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/training/train_vs_val_loss.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/training/pr_f1_acc_recall_graphs.png)

Figure 1:Training dynamics\. Left: total train/validation loss\. Right: boundary\-detection precision, recall, F1, and accuracy over training\.![Refer to caption](https://arxiv.org/html/2606.18717v1/results/training/train_vs_val_aux_loss.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/training/train_vs_val_sgns_loss.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/training/train_vs_val_ctr_loss.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/training/train_vs_val_mlm_loss.png)

Figure 2:Per\-objective train/validation curves: auxiliary boundary loss, skip\-gram \(SGNS\), root\-identity contrastive, and character\-level MLM\.![Refer to caption](https://arxiv.org/html/2606.18717v1/results/training/learning_rate.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/training/aux_weight_decay.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/training/train_gradient_norm.png)

Figure 3:Optimization regime\. Left: cosine learning\-rate schedule\. Middle: geometric decay of the auxiliary\-loss weight \(0\.50→0\.080\.50\\\!\\rightarrow\\\!0\.08\), realizing the teacher\-to\-distributional curriculum\. Right: gradient norm, stable throughout under full\-precision training\.
### 4\.2Experimental setup

All tokenizers are trained on the same corpus to ensure a fair comparison\. The baselines are BPE, byte\-level BPE, and Unigram \(SentencePiece,6464K\), WordPiece \(6464K, HuggingFace\), Morfessor, and the rule\-based TurkishTokenizer\(Bayram et al\.,[2025a](https://arxiv.org/html/2606.18717#bib.bib3)\); Morpheus uses a5050K vocabulary distilled from its own hard segmentations\. For language modeling we train a parameter\-equalized∼\\sim58M GPT with each tokenizer for an identical10,00010\{,\}000optimizer steps on the same data and schedule, so that bits\-per\-character \(BPC\) reflects the tokenizer rather than model capacity or compute\. Intrinsic metrics use a stratified test set \(seen / OOV / curated\-OOV / nonce\) and gold sets: UD\_Turkish\-Kenet for MorphScore and reversibility \(3030K inflected words\) and the SIGMORPHON 2022 Turkish inflection set\. Embedding evaluations use frozen word vectors and a common probe across encoders, comparing Morpheus to BERTurk\(Schweter,[2020](https://arxiv.org/html/2606.18717#bib.bib15)\)and BGE\-M3\(Chen et al\.,[2024](https://arxiv.org/html/2606.18717#bib.bib7)\)\.

### 4\.3Reversibility: the generation gate

Table[1](https://arxiv.org/html/2606.18717#S4.T1)reportsdecode\(encode\(w\)\)=w\\mathrm\{decode\}\(\\mathrm\{encode\}\(w\)\)=wover30,20430\{,\}204inflected wordforms\. Morpheus and the subword family are reversible; the two tokenizers that elsewhere appear strongest are not\. WordPiece recovers only58\.2%58\.2\\%of words because it strips Turkish diacritics, and TurkishTokenizer95\.4%95\.4\\%because its canonical re\-harmonization rewrites surface forms—for example, it mapssaatlerde\(“at the hours”\) tosaat\|\|lar\|\|da, which decodes to the non\-wordsaatlarda\. Since a generative model must decode every produced id back to faithful text, only the reversible subset is valid for generation—this is the gate through which the remaining comparisons are read\.

Table 1:Reversibility over30,20430\{,\}204inflected words\. WordPiece strips diacritics; TurkishTokenizer applies lossy canonicalization\.![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_roundtrip.png)Figure 4:Roundtrip accuracy per tokenizer\. The reversible cluster \(Morpheus, BPE/ByteBPE/Unigram\) versus the lossy WordPiece and TurkishTokenizer\.
### 4\.4Surface fidelity

A tokenizer can place boundaries well yet still corrupt the surface string\. We probe this with a curated set of5050OOV\-leaning Turkish words, scoring each segmentation along four increasingly strict criteria \(Table[2](https://arxiv.org/html/2606.18717#S4.T2)\):*root%*, whether the first segment is the correct root;*count%*, whether the number of segments matches the gold;*len%*, whether the segment*lengths*match \(i\.e\. the boundaries are placed correctly\); and*exact%*, whether the segment*strings*exactly match the surface morphemes\.

The decisive comparison is the drop from len% to exact%, which isolates decode corruption from boundary placement\. Morpheus identifies the root best of all tokenizers \(66%66\\%\) and, critically, shows*no*drop from len to exact \(38%→38%38\\%\\\!\\rightarrow\\\!38\\%\): every boundary it places is also a faithful surface string, the signature of lossless decoding\. TurkishTokenizer presents the opposite pattern: it places boundaries best \(count=92%\\text\{count\}=92\\%,len=78%\\text\{len\}=78\\%\) but its strings match the surface only10%10\\%of the time—a6868\-point collapse\. The mechanism is concrete and systematic: on the loanword\-exception formssaatlerde,rollerde,harflerle, TurkishTokenizer returnssaat\|\|lar\|\|da,rol\|\|lar\|\|da,harf\|\|lar\|\|la—boundaries correct, but the surface suffixes\-ler/\-deare rewritten to their canonical vowel\-harmonic forms\-lar/\-da, so the decoded strings \(saatlarda, …\) are no longer the input words\. Morpheus returnssaatler\|\|de,rol\|\|lerde—surface\-exact, hence reversible\. The subword tokenizers are low and roughly flat across len and exact \(they neither normalize nor align\), confirming that the len→\\rightarrowexact gap is a clean diagnostic for the lossy canonicalization unique to the rule\-based system\. Table[3](https://arxiv.org/html/2606.18717#S4.T3)traces this through concrete decode outcomes: notably, even when Morpheus places a boundary incorrectly \(çi\|\|çe\|\|ğin\), its decode still reconstructs the input, because the segmentation only groups characters—whereas TurkishTokenizer and WordPiece, with cleaner\-looking or whole\-word outputs, decode to non\-words\.

Table 2:Qualitative surface fidelity on5050curated OOV\-leaning words\.*root%*: first segment is the correct root;*count%*: segment count matches gold;*len%*: boundaries placed correctly;*exact%*: segment strings match the surface morphemes\. The len%→\\rightarrowexact% drop isolates decode corruption: zero for Morpheus,6868points for TurkishTokenizer \(e\.g\.saatlerde→\\rightarrowsaat\|\|lar\|\|da\)\.†Not reversible\.Table 3:Representative decode outcomes\. Morpheus is surface\-preserving: even where its boundaries are imperfect \(çi\|\|çe\|\|ğin\), the concatenation still reproduces the input\. TurkishTokenizer rewrites surface allomorphs to canonical forms \(\-üm,\-lar/\-da,\-ün\) and WordPiece strips diacritics \(ç,ğ,ı\), so both decode to non\-words\. BPE is reversible but morphology\-blind \(no split\)\.
### 4\.5Morphological alignment

On gold morphological segmentation, Morpheus and the rule\-based TurkishTokenizer far outrank the subword family, with Morpheus the strongest*reversible*option \(Table[4](https://arxiv.org/html/2606.18717#S4.T4)\)\. On MorphScore \(UD\_Turkish\-Kenet\), Morpheus reaches a macro\-F1 of0\.610\.61, roughly double the subword family \(∼\\sim0\.32\) and close to TurkishTokenizer \(0\.650\.65\)—but with zero length\-mismatch, whereas TurkishTokenizer’s score carries the canonical\-normalization caveat shown above\. On SIGMORPHON inflection, Morpheus has the best lemma\-prefix rate after Morfessor \(0\.760\.76\), and the Kalbur root\-correction of its teacher lifts root\-in\-segments from0\.350\.35\(Morfessor\) to0\.480\.48\.

Table 4:Morphological alignment: MorphScore macro\-F1 \(UD\_Turkish\-Kenet\) and SIGMORPHON lemma\-prefix and root\-in\-segments rates\. Morpheus is the strongest reversible option\.†Not reversible\.![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_morphscore.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_sigmorphon.png)

Figure 5:Morphological alignment\. Left: MorphScore \(UD\_Turkish\-Kenet\) macro\-F1\. Right: SIGMORPHON inflection rates \(lemma\-prefix and root\-in\-segments\)\.
### 4\.6Language modeling and efficiency

To compare tokenizers under equal compute, each∼\\sim58M GPT is trained for an identical10,00010\{,\}000optimizer steps on a11M\-line cap of the corpus with the same schedule; Figure[6](https://arxiv.org/html/2606.18717#S4.F6)shows the resulting training\-loss and validation\-BPC curves\. The curves are well\-behaved and stratify clearly: among reversible tokenizers, Morpheus reaches the lowest validation BPC \(1\.4251\.425vs\.1\.4361\.436for BPE,1\.4491\.449for ByteBPE,1\.4371\.437for Unigram,1\.4461\.446for Morfessor\)\. WordPiece’s nominally lower1\.3841\.384is an artifact of modeling diacritic\-stripped, lower\-entropy text, and TurkishTokenizer’s1\.4421\.442comes with lossy decoding—both excluded from the valid comparison \(Table[5](https://arxiv.org/html/2606.18717#S4.T5)\)\. On TR\-MMLU, Morpheus attains the highest frequency\-weighted purity \(83\.5%83\.5\\%%Pure\) and Turkish\-token rate \(91\.8%91\.8\\%%TR\) of all tokenizers, indicating that the tokens it actually emits in running text align with Turkish morphemes\. Its fertility \(1\.731\.73tokens/word\) sits between the subword family \(∼\\sim1\.5\) and the rule\-based tokenizers \(∼\\sim1\.9–2\.0\): the deliberate cost of morpheme\-level tokenization\. At generation, Morpheus uses∼\\sim19% less peak GPU memory than the6464K\-vocab subword tokenizers \(3,0203\{,\}020vs\.3,7233\{,\}723MB at batch3232\), while its higher fertility lowers raw character throughput \(Figure[7](https://arxiv.org/html/2606.18717#S4.F7)\)\.

#### Tokenizer throughput vs\. generation throughput\.

It is important to separate the tokenizer’s*own*speed from end\-to\-end generation, as the two tell different stories \(Figure[10](https://arxiv.org/html/2606.18717#S4.F10)\)\. Morpheus’s pure\-PyTorch encoder runs at∼\\sim4\.0M chars/s—faster than BPE/ByteBPE \(∼\\sim1\.0M\) and WordPiece \(2\.22\.2M\), behind Unigram \(4\.84\.8M\)—and its decoder reaches∼\\sim0\.69M words/s, nearly2×2\\timesthe subword family \(∼\\sim0\.35–0\.38M\)\. TurkishTokenizer is fastest on both \(6\.16\.1M chars/s,0\.920\.92M words/s\), but this partly reflects its Rust backend rather than a lower algorithmic cost; Morpheus is a research\-grade PyTorch implementation and is still competitive\. The takeaway is that the∼\\sim1\.6×\\timesend\-to\-end generation gap \(Figure[7](https://arxiv.org/html/2606.18717#S4.F7)\) is driven by Morpheus’s higher*fertility*—more autoregressive forward passes per character—not by slow tokenization: the tokenizer itself is fast, and its decode is among the quickest measured\.

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_lm_training.png)Figure 6:Downstream language\-model training\. Left: training loss versus optimizer step for the param\-equalized5858M GPT under each tokenizer\. Right: validation BPC\. Among reversible tokenizers Morpheus reaches the lowest BPC\.TokenizerBPCFert\.%Purefw\{\}\_\{\\text\{fw\}\}GPUtok/wMBMorpheus1\.4251\.7383\.53020BPE1\.4361\.5148\.83723ByteBPE1\.4491\.5349\.13723Unigram1\.4371\.5250\.03723Morfessor1\.4461\.9177\.81977WordPiece†1\.3841\.3940\.13723TurkishTok\.†1\.4421\.9878\.22152Table 5:Language modeling and efficiency\. BPC at equal1010K steps; frequency\-weighted %Pure on TR\-MMLU; peak GPU memory at batch3232\.†Not reversible—excluded from the valid BPC comparison\.![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_pareto_bpc_gen.png)Figure 7:BPC versus generation throughput\. Among reversible tokenizers Morpheus is on the quality frontier, trading throughput \(higher fertility\) for the lowest BPC and morphological structure\.![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_bpc.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_gpu_memory.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_gen_throughput.png)

Figure 8:Language\-modeling efficiency\. Left: BPC at equal1010K steps\. Middle: peak GPU memory during generation\. Right: end\-to\-end generation throughput\.![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_trmmlu.png)Figure 9:TR\-MMLU tokenization quality: Turkish\-token \(%TR\) and pure\-token \(%Pure\) rates\. Morpheus leads on the frequency\-weighted measures\.![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_encode_speed.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/figures/fig_decode_speed.png)

Figure 10:Tokenizer throughput, separate from end\-to\-end generation\. Left: encoding speed \(chars/s\)\. Right: decoding speed \(words/s\)\. Morpheus’s decode is∼\\sim2×\\timesthe subword family; TurkishTokenizer leads on both, partly via its Rust backend\.

### 4\.7Morpheus as a word embedder

Because Morpheus is neural, the same forward pass that tokenizes also emits a320320\-dim word embedding\. We evaluate it frozen against BERTurk and BGE\-M3 \(Table[6](https://arxiv.org/html/2606.18717#S4.T6), Figure[12](https://arxiv.org/html/2606.18717#S4.F12)\)\. The picture splits sharply by task character, and the split is a direct consequence of how the embedding is trained\.

#### Where Morpheus wins: lexical / root\-level tasks\.

On retrieving other forms of the same root and on verifying whether two words share a root, Morpheus leads decisively—root\-family retrieval MAP0\.850\.85\(vs\.0\.800\.80for BGE\-M3,0\.490\.49for BERTurk\) and same\-root verification ROC\-AUC1\.001\.00\(vs\.0\.980\.98,0\.700\.70\)—despite the*smallest*embedding \(320320vs\.768768/10241024dims\)\. This is by design: the root\-identity contrastive objective explicitly pulls all inflections of a root toward a common point, so the geometry is organized around roots\. The t\-SNE projections \(Figure[11](https://arxiv.org/html/2606.18717#S4.F11)\) make this visible—Morpheus produces the tightest, most clearly separated root\-family clusters of the three encoders\.

#### Where Morpheus loses: context\- and inflection\-dependent tasks\.

On morphological probing of number \(0\.590\.59vs\.0\.950\.95for BERTurk\) and case \(0\.220\.22vs\.0\.890\.89\) and on WikiANN NER \(macro\-F10\.480\.48vs\.0\.790\.79\), the heavier contextual encoders win\. This too follows from the architecture, on two counts\. First, the very objective that sharpens root geometry*collapses*the inflectional contrasts a probe must read: by pullingkitap,kitaplar,kitabımıztogether, it deliberately discards the number/case signal that distinguishes them\. Second, the embedding is a*static*, per\-word vector with no sentence context, whereas NER is inherently contextual—and BERTurk/BGE\-M3 are contextual encoders with22–3×3\\timesthe dimensionality\. Morpheus is therefore not a drop\-in replacement for a contextual encoder; it is a complementary, cheap, morphology\-aware*lexical*encoder\. In a multi\-vector retrieval \(RAG\) system this is precisely the right division of labor: Morpheus serves the lexical/keyword index \(root matching, dedup, stemming\), a contextual model serves the dense semantic index\.

Table 6:Frozen word\-embedding evaluation\. Morpheus leads on lexical / root\-level tasks; contextual encoders lead on inflection\- and context\-dependent tasks\.![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/embeddings/figures/tsne_morpheus.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/embeddings/figures/tsne_berturk.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/embeddings/figures/tsne_bge-m3.png)

Figure 11:t\-SNE of word embeddings colored by root family, for Morpheus \(left\), BERTurk \(middle\), and BGE\-M3 \(right\)\. Morpheus organizes the space by root identity, producing the tightest root\-family clusters\.![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/embeddings/figures/neighbors_map.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/embeddings/figures/dedup_auc.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/embeddings/figures/probing_accuracy.png)

![Refer to caption](https://arxiv.org/html/2606.18717v1/results/paper_eval/embeddings/figures/ner_f1.png)

Figure 12:Embedding evaluation across encoders\. Morpheus leads on lexical retrieval \(MAP\) and same\-root verification \(ROC\-AUC\); the heavier contextual encoders lead on morphological probing and NER\.

## 5Discussion

#### One signal, two roles\.

The results support the paper’s central claim: a single neural morpheme\-boundary model can serve as both a lossless tokenizer and a word embedder\. The coupling is not incidental—the differentiable Poisson–binomial segmentation lets the same morphological signal that places boundaries also shape the pooled embedding, so quality on one role reinforces the other rather than competing for capacity\.

#### Lossless\-versus\-lossy is the decisive axis\.

The two tokenizers that appear to dominate on isolated metrics—WordPiece on raw BPC, TurkishTokenizer on gold morphology—are both disqualified for generation by reversibility\. Reading every metric through the generation gate reverses the apparent ranking: among tokenizers whose ids decode to faithful Turkish, Morpheus offers the lowest BPC, the highest frequency\-weighted token purity, the strongest morphological alignment, and lower memory, simultaneously\. We argue this axis, largely absent from prior Turkish tokenization comparisons, should be reported whenever a tokenizer is proposed for generative use\.

#### A root\-centric embedding, by design\.

The embedding results are a genuine finding, not a shortfall to hide\. Morpheus wins lexical retrieval and dedup but underperforms on number/case probing and NER, and the cause is mechanistic: the contrastive objective on root identity deliberately pulls all inflections of a root together, which sharpens root\-level geometry while collapsing the inflectional contrasts a linear probe would read, and the pooled static vector lacks the sentence context NER needs\. This makes Morpheus complementary to, not a replacement for, contextual encoders\. In a multi\-vector retrieval system its embeddings are a natural fit for the*lexical*index—cheap, morphology\-aware, and strong at root matching— while a contextual model such as BGE\-M3 or BERTurk serves the dense semantic index\.

#### What you trade\.

Morpheus brings modeling quality, morphological structure, embeddings, lossless reversibility, and lower memory together, a combination no other Turkish tokenizer offers\. The cost is higher fertility \(∼\\sim1\.73 vs\.∼\\sim1\.5 tokens/word\) and, because unseen words are segmented by the neural model rather than a lookup table, a heavier tokenizer artifact and lower raw character throughput\. For latency\-bound generation a subword tokenizer remains preferable; for Turkish systems that value faithful decoding, morphology, or embeddings, Morpheus is the better\-informed default\.

## 6Limitations and Trade\-offs

We frame the constraints of Morpheus as trade\-offs rather than flat deficiencies: each cost is the flip side of a concrete gain, and points to the workloads where Morpheus is—or is not—the right choice\.

#### Fertility for quality and faithfulness\.

Morpheus emits more tokens per word \(∼\\sim1\.73 vs\.∼\\sim1\.5 for subwords\), which lengthens sequences and lowers raw generation throughput \(∼\\sim1\.6×\\timesslower than BPE\)—a token\-count effect rather than slow tokenization, since its own encode/decode are competitive \(Section[4\.6](https://arxiv.org/html/2606.18717#S4.SS6)\)\. In return it delivers the lowest BPC among reversible tokenizers \(1\.4251\.425\), morpheme\-aligned tokens, lossless decoding, and∼\\sim19% lower GPU memory\. The exchange favors quality\- and morphology\-sensitive systems; for latency\-bound raw generation a subword tokenizer remains preferable\.

#### A neural artifact for OOV generalization\.

Because unseen words are segmented by the model rather than a lookup table, the deployable tokenizer carries a PyTorch checkpoint instead of a few\-megabyte vocabulary\. That same property is what lets Morpheus segment*any*Turkish word—including nonce and rare agglutinative forms—without a vocabulary cap, which a fixed BPE/WordPiece table cannot do\.

#### A root\-centric embedding: strength and limit are the same design\.

The embedding leads on lexical retrieval \(MAP0\.850\.85\) and same\-root verification \(ROC\-AUC1\.001\.00\) precisely because the contrastive objective concentrates a root’s inflections; that same concentration is why it trails contextual encoders on number/case probing and NER\. The embedding is also static and lower\-dimensional \(320320vs\.768768/10241024\)\. Morpheus is therefore complementary to, not a replacement for, contextual encoders: it is the right representation for the lexical component of a system \(retrieval, dedup, stemming, keyword matching\) and the wrong one for tasks that hinge on sentence context or fine inflectional features\.

#### Scope\.

The model and its supervision are Turkish\-specific by design, and the gold sets emphasize inflectional morphology \(SIGMORPHON, UD\_Turkish\-Kenet\), so derivational families and long, rare agglutinative chains—where the boundary detector occasionally merges adjacent suffixes—are comparatively under\-probed\.

## 7Conclusion

Turkish agglutination breaks the assumptions of the tokenizers that drive modern language models\. Frequency\-driven subword methods fragment meaning\-bearing suffixes and inflate token counts, while the tokenizers that align best with morphology—WordPiece and the rule\-based TurkishTokenizer—do so by rewriting the surface string and cannot decode their output back to faithful text \(only58\.2%58\.2\\%and95\.4%95\.4\\%roundtrip\)\. Word representation, meanwhile, is handled by separate, heavyweight models decoupled from tokenization\. This is the gap the paper addresses\.

#### Novelty and mechanism\.

We introducedMorpheus, a neural morpheme\-boundary model that is at once a lossless, morphology\-aware tokenizer and a word embedder\. The novelty is a single mechanism—a differentiable Poisson–binomial segmentation—that \(i\) lets word\-level objectives train the boundary detector end\-to\-end, \(ii\) recovers exact hard segmentation at inference with no architectural switch, and \(iii\) only*groups*characters, sodecode\(encode\(w\)\)=w\\mathrm\{decode\}\(\\mathrm\{encode\}\(w\)\)=wholds by construction and the same forward pass yields a structured embedding\.

#### Measured success\.

Restricted to tokenizers whose ids decode to faithful Turkish—the set valid for generation—Morpheus simultaneously attains the lowest BPC \(1\.4251\.425\), the highest frequency\-weighted token purity on TR\-MMLU \(83\.5%83\.5\\%\), the strongest morphological alignment \(MorphScore macro\-F10\.610\.61,∼\\sim2×\\timesthe subword family\),100%100\\%reversibility, and∼\\sim19% lower GPU memory\. As an embedder it leads on lexical retrieval \(root\-family MAP0\.850\.85\) and same\-root verification \(ROC\-AUC1\.001\.00\), ahead of BGE\-M3 \(0\.800\.80/0\.980\.98\) and BERTurk \(0\.490\.49/0\.700\.70\)\. These survive the reversibility gate that disqualifies the apparent leaders, so they are real gains rather than metric artifacts\.

#### Trade\-offs and where to use it\.

The costs are concrete: higher fertility \(1\.731\.73vs\.∼\\sim1\.5 tokens/word,∼\\sim1\.6×\\timesslower generation\), a neural artifact instead of a lookup table, and a root\-centric embedding that trails contextual encoders on NER and number/case probing\. This yields a clear usage recipe\. Morpheus is the better\-informed default for Turkish*NLU and sequence\-labeling*\(classification, morphological segmentation/analysis\), for the*lexical / keyword index of a multi\-vector RAG*system \(root matching, dedup, stemming\), for pretraining small\-to\-medium Turkish LMs where faithful decoding and morphology matter, and for*memory\-constrained*inference\. It should be paired with—not substituted for—a contextual encoder such as BERTurk or BGE\-M3 on context\-dependent tasks, and a subword tokenizer remains preferable for latency\-bound raw generation\. In expanding the Turkish tokenization design space with a lossless, morphology\-aware, embedding\-producing option, Morpheus gives the many Turkish systems that have so far had to choose among lossy or morphology\-blind alternatives a single model that is none of those things\.

## References

- Akın and Akın \(2007\)Ahmet Afşın Akın and Mehmet Dündar Akın\. 2007\.Zemberek, an open source NLP framework for Turkic languages\.*Structure*\.
- Altinok \(2026\)Duygu Altinok\. 2026\.Optimal Turkish subword strategies at scale: Systematic evaluation of data–vocabulary–morphology interplay\.*arXiv preprint arXiv:2602\.06942*\.
- Bayram et al\. \(2025a\)M\. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, Savaş Yıldırım, and Demircan Çelik\. 2025a\.Tokens with meaning: A hybrid tokenization approach for Turkish\.*arXiv preprint arXiv:2508\.14292*\.
- Bayram et al\. \(2025b\)M\. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, and Savaş Yıldırım\. 2025b\.Tokenization standards for linguistic integrity: Turkish as a benchmark\.*arXiv preprint arXiv:2502\.07057*\.
- Bayram et al\. \(2025c\)M\. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri, and Savaş Yıldırım\. 2025c\.Tokenization standards and evaluation in natural language processing: A comparative analysis of large language models on Turkish\.In*2025 33rd Signal Processing and Communications Applications Conference \(SIU\)*\. IEEE\.
- Bayram et al\. \(2026\)M\. Ali Bayram, Banu Diri, and Savaş Yıldırım\. 2026\.Adapting multilingual embedding models to Turkish via cross\-lingual tokenizer surgery and offline distillation\.*arXiv preprint arXiv:2605\.29992*\.
- Chen et al\. \(2024\)Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu\. 2024\.BGE M3\-embedding: Multi\-lingual, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.*arXiv preprint arXiv:2402\.03216*\.
- Creutz and Lagus \(2002\)Mathias Creutz and Krista Lagus\. 2002\.Unsupervised discovery of morphemes\.In*Proceedings of the ACL\-02 Workshop on Morphological and Phonological Learning \(SIGPHON\)*, pages 21–30\.
- Creutz and Lagus \(2007\)Mathias Creutz and Krista Lagus\. 2007\.Unsupervised models for morpheme segmentation and morphology learning\.*ACM Transactions on Speech and Language Processing*, 4\(1\):1–34\.
- Devlin et al\. \(2019\)Jacob Devlin, Ming\-Wei Chang, Kenton Lee, and Kristina Toutanova\. 2019\.BERT: Pre\-training of deep bidirectional transformers for language understanding\.In*Proceedings of NAACL*, pages 4171–4186\.
- Gulgonul \(2025\)Senol Gulgonul\. 2025\.HeceTokenizer: A syllable\-based tokenization approach for Turkish retrieval\.Preprint\.
- Kaya and Tantuğ \(2024\)Yiğit Bekir Kaya and A\. Cüneyd Tantuğ\. 2024\.Effect of tokenization granularity for Turkish large language models\.*Intelligent Systems with Applications*, 21:200335\.
- Kudo \(2018\)Taku Kudo\. 2018\.Subword regularization: Improving neural network translation models with multiple subword candidates\.In*Proceedings of ACL*, pages 66–75\.
- Kudo and Richardson \(2018\)Taku Kudo and John Richardson\. 2018\.SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing\.In*Proceedings of EMNLP: System Demonstrations*, pages 66–71\.
- Schweter \(2020\)Stefan Schweter\. 2020\.BERTurk – BERT models for Turkish\.Zenodo\.
- Sennrich et al\. \(2016\)Rico Sennrich, Barry Haddow, and Alexandra Birch\. 2016\.Neural machine translation of rare words with subword units\.In*Proceedings of ACL*, pages 1715–1725\.
- Su et al\. \(2021\)Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu\. 2021\.RoFormer: Enhanced transformer with rotary position embedding\.*arXiv preprint arXiv:2104\.09864*\.
- Toraman et al\. \(2023\)Cagri Toraman, Eyup Halit Yilmaz, Furkan Şahınuç, and Oguzhan Ozcelik\. 2023\.Impact of tokenization on language models: An analysis for Turkish\.*ACM Transactions on Asian and Low\-Resource Language Information Processing*, 22\(4\):1–21\.
Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Similar Articles

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning

MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense

Submit Feedback

Similar Articles

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning
MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense