PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

arXiv cs.CL Papers

Summary

Introduces PACUTE, a diagnostic benchmark of 4,600 tasks evaluating morphological understanding in Filipino, revealing that even frontier models struggle with morpheme decomposition and productive morphological composition.

arXiv:2606.15144v1 Announce Type: new Abstract: Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:45 AM

# Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino
Source: [https://arxiv.org/html/2606.15144](https://arxiv.org/html/2606.15144)
Jann Railey Montalan1,2,David Demitri Africa3††footnotemark:,Jimson Paulo Layacan, Richell Isaiah Flores4,Ivan Yuri De Leon4,Lance Calvin Gamboa5,

1AI Singapore,2Nanyang Technological University,3UK AI Security Institute, 4Ateneo de Manila University,5University of Birmingham

###### Abstract

Large language models \(LLMs\) process text as sequences of subword tokens, which can obscure the character\-level and morphological structure that underlies word formation\. This limitation is most acute for languages with non\-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries\. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic\-driven lexical distinctions that are typically absent from written text\. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down\. Evaluating open\-weight LLMs and frontier commercial models, we find that open\-weight models perform near chance on morpheme decomposition regardless of scale\. Frontier models perform much better, often recovering individual affixes under contains\-match scoring, but remain far below their character\-level ceilings on compositional tasks of morpheme transformations and syllabification\. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word\-structure understanding\.

PACUTE: Phonology\-, Affix\-, and Character\-level Understanding of Tokens for Filipino

Jann Railey Montalan1,2††thanks:Corresponding author:[railey@aisingapore\.org](https://arxiv.org/html/2606.15144v1/mailto:[email protected])††thanks:Equal contribution, David Demitri Africa3††footnotemark:, Jimson Paulo Layacan,Richell Isaiah Flores4,Ivan Yuri De Leon4,Lance Calvin Gamboa5,1AI Singapore,2Nanyang Technological University,3UK AI Security Institute,4Ateneo de Manila University,5University of Birmingham

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.15144v1/x1.png)Figure 1:Overview of PACUTE\.\(A\) Filipino morphology poses challenges for standard tokenizers: infixes split roots, reduplication copies syllables, and stress diacritics are typically omitted in text\. \(B\) PACUTE comprises four task categories targeting different levels of word structure understanding\. \(C\) A six\-level hierarchical diagnostic localizes failures: models show above\-chance performance on character\-level tasks \(L0–L1\) but collapse to chance on morpheme decomposition \(L2\), with downstream levels inheriting this failure regardless of scale\.Large language models segment text into subword tokens and process them as atomic units, which limits their direct access to character\-level structure\. Benchmarks such as CUTE\(edman\-etal\-2024\-cute\)and LangGame\(sims2025stochastokimprovingfinegrainedsubword\)demonstrate that many LLMs struggle with basic token\-level tasks such as counting characters, character manipulations, and substring detection\.

#### Filipino poses distinct challenges for tokenizers\.

We claim that Filipino, the national language of the Philippines, is a natural \(and naturalistic\) testbed to evaluate such capabilities in LLMs, for two reasons\.

#### Productive non\-concatenative morphology\.

Filipino uses infixes \(e\.g\., \-um\-, \-in\-\) that split root words and partial reduplication of initial syllables\. For example, the rootkain\`\`eat'' becomeskumain\`\`ate'' via the actor\-focus infix \-um\-, inserted after the first consonant\. Standard byte\-pair encoding tokenizerssennrich\-etal\-2016\-neuralare suboptimal for language model pre\-trainingbostrom2020byte, and frequently segment such words at arbitrary boundaries that ignore morpheme structure\.

#### Omitted stress and glottal markers\.

Filipino orthography can encode lexical distinctions through diacritics: acute \(pahilís\), grave \(paiwà\), and circumflex \(pakupyâ\)\. These create minimal pairs\(warstadt2020blimp\):bása\`\`read'' vs\.basâ\`\`wet'';súka\`\`vomit'' vs\.sukà\`\`vinegar'';táyo\`\`we'' vs\.tayô\`\`stand''\. However, everyday Filipino text typically omits these markers, creating systematic lexical ambiguity invisible to models\.

Based on this, we introduce PACUTE \(Phonology\-,Affix\-, andCharacter\-levelUnderstanding ofTokensEvaluation\), a benchmark of 4,600 synthetic tasks targeting Filipino composition, manipulation, morphological extraction, morphological production, and syllabification, each available in multiple\-choice \(log\-probability\) and generative formats\. PACUTE also includes a hierarchical diagnostic set \(six levels\) that separates prerequisite character skills from morpheme\-level operations, enabling localization of bottlenecks\.

We evaluate several pre\-trained and instruction\-tuned LLMs across major open families and sizes, as well as run continued pre\-training on Gemma\-2\-2B with three preprocessing/tokenization regimes: vanilla BPE, stochastic token expansion \(StochasTok\)sims2025stochastokimprovingfinegrainedsubword, and a Filipino morphology\-aware expand/contract method based on StochasTok\. Our results show:\(i\) open\-weight models collapse to chance on morpheme decompositionunder MCQ log\-probability scoring, despite above\-chance performance on lower\-level character tasks; and\(ii\) frontier commercial modelsoften recover individual affixes, butremain substantially below their character\-level ceilings on hierarchical morpheme transformations and syllabification; Filipino\-relevant pre\-training improves mid\-range models, while tokenization interventions alone produce limited gains and can induce catastrophic forgetting\.

## 2Related Work

#### Relevant properties and evaluation of Filipino\.

Filipino has been characterized in the linguistic literature as exhibiting several non\-concatenative morphological processes, including infixation \(e\.g\.,\-um\-,\-in\-\), partial and full reduplication, and distinctions in stress and glottal stop that may be encoded through diacriticsschachter1983tagalog;yap1967synchronic\. These phenomena interact closely with phonological structure and are not always transparently reflected in surface orthography, particularly in naturally occurring text where diacritics are frequently omitted\. As a result, many linguistically meaningful contrasts in Filipino are obscured at the character and token levels typically used in computational models\.

Despite these properties, computational research on Filipino has largely focused on downstream NLP tasks such as sentiment analysis, part\-of\-speech tagging, and named entity recognitioncruz2022improving;visperas2023itanong\. A recent benchmark, BATAYANmontalan2025batayan, broadens evaluation coverage across understanding and generation tasks, yet remains primarily task\-oriented rather than diagnostic of Filipino\-specific linguistic structure\.

#### Token and character level evaluations\.

In the broader literature, several studies have introduced token\- and character\-level diagnostic evaluations, for other languages\. These include tests based on spelling, orthographic similarity, and character\-level perturbationsitzhak2022models;huang2023inducing, as well as synthetic tasks involving letter counting, prefix or suffix identification, and substring detectionkaushal2022tokens;efrat2023lmentry\. CUTE extends this line of work by systematically evaluating models’ token composition knowledge through character\-level manipulations and token\-based reasoning tasksedman\-etal\-2024\-cute\. While effective as a general diagnostic framework, CUTE is designed to be language\-agnostic and does not explicitly target non\-concatenative morphology or phonology\-driven alternations\.

Morphology benchmarks such as SIGMORPHON evaluate inflection across diverse languagesvylomova2020sigmorphon, but emphasize concatenative processes and transparent orthography, which are assumptions that do not hold for Filipino\. Subword tokenization variants such as BPE\-dropoutprovilkov\-etal\-2020\-bpeimprove robustness to surface variation, but evaluation has focused on downstream performance rather than linguistically grounded morphological structure\. PACUTE fills this gap with Filipino\-specific diagnostics targeting phonology, affixation, and character\-level structure\.

## 3PACUTE: Task Suite

#### Goal\.

PACUTE is a diagnostic benchmark for Filipino word\-structure competence: whether a model can \(i\) access character\-level information inside tokens and \(ii\) use that information to reason about productive morphology \(especially infixation and reduplication\)\. PACUTE is not designed mainly to measure general Filipino \`\`understanding’’ via downstream tasks; rather, it targets concrete, local operations that underlie morphological generalization\.

#### Why tokenization is a plausible failure source\.

Standard subword tokenizers \(e\.g\., BPE\) optimize compression and frequency statistics rather than morpheme boundary preservation\. For Filipino, infixes and reduplicated segments often fall inside token interiors or are split inconsistently across tokens, making them hard to access as reusable units during pre\-training and inference\. Diacritics encoding lexical stress distinctions \(bása\`\`read’’ vs\.basâ\`\`wet’’\) are nearly always absent from digital textalmario2014masinop, so models face a systematic mismatch between lexicographic and naturally occurring forms\. Together, infixation, reduplication, morphophonemic alternations, and diacritic omission create a setting where surface\-form competence is not sufficient for morphological competence \(see Appendix[A](https://arxiv.org/html/2606.15144#A1)for a detailed treatment\)\. This motivates two PACUTE design choices: \(i\) tasks that explicitly require operations aligned with morphological structure rather than only generic character probes, and \(ii\) a hierarchical diagnostic that separates prerequisite character skills from morpheme\-level skills, allowing localization of where the compositional pipeline breaks down\.

#### Task formats\.

Each PACUTE task is provided in two evaluation formats\.MCQinstances present four answer options \(chance = 25%\)\.GENinstances require the model to produce an answer string\. We include both formats because MCQ reduces generation and formatting confounds, while GEN tests whether the model can reliably execute the operation\. Scoring procedures for both formats are described in §[5](https://arxiv.org/html/2606.15144#S5)\.

### 3\.1Main suite and coverage

The main PACUTE suite contains 4,600 tasks across five categories, each targeting a different slice of subtoken competence: \(i\)Composition\(950 MCQ \+ 550 GEN\), \(ii\)Manipulation\(800 MCQ \+ 800 GEN\), \(iii\)Morphological extraction\(400 MCQ \+ 400 GEN\), \(iv\)Morphological production\(150 MCQ \+ 150 GEN\), \(v\)Syllabification\(200 MCQ \+ 200 GEN\)\. The composition and manipulation tasks provide character\-level controls; the morphological extraction, morphological production, and syllabification tasks target Filipino\-specific structure\.

#### Composition tasks\.

Composition tasks test whether a model can use orthographic information that is typically hidden by tokenization: character counting, identification of specific characters/diacritics, and simple string properties\. These are Filipino\-adapted variants of standard character\-level probes, using Filipino lexicon and \(where relevant\) diacritic\-bearing forms\. The intent is to separate \`\`can the model see inside tokens?'' from morphology\-specific reasoning\.

#### Manipulation tasks\.

Manipulation tasks test deterministic character operations such as insertion, deletion, substitution, and swapping\. These tasks are intentionally low\-level: they test whether models can execute controlled edits that mirror the mechanics needed for infix insertion and reduplication \(even if the correct position for the edit is linguistically determined in the affixation tasks\)\. This category also functions as a sanity check: if a model cannot reliably delete or insert characters, failures on morphology may be uninterpretable\.

#### Morphological extraction and production tasks\.

Morphological tasks test whether models can identify and apply Filipino affixes, including prefixes, suffixes, and crucially, infixes\.Extractionasks which affix, reduplicant, or root can be identified from an inflected form \(e\.g\., wordkumain→\\rightarrowrootkain\+\+completed aspect infix\-um\-\)\.Productionasks for the correct derived form given a root and affix \(e\.g\., apply completed aspect affix\-in\-tosulat→\\rightarrowsinulat\)\.

#### Syllabification tasks\.

Syllabification tasks test the phonological scaffolding needed for Filipino morphology\. We treat syllable\-level competence as an intermediate level between raw character manipulation and morpheme\-level reasoning\. This category includes stress identification tasks, which ask the model to identify the stressed syllable of a given word as it appears in a disambiguating sentence context\. Because Filipino orthography conventionally omits stress diacritics, stress assignment is non\-trivial: many words are orthographically identical but phonologically and semantically distinct depending on which syllable carries primary stress \(e\.g\.,sála\`\`sin'' vs\.salà\`\`filter'' vs\.salâ\`\`broken''\)\. Models must therefore use both lexical knowledge and sentential context to resolve stress\.

#### Hierarchical diagnostic benchmark\.

In addition to the main suite, PACUTE includes a hierarchical diagnostic benchmark \(600 MCQ \+ 600 GEN\) organized into six levels: Level 0 \(character recognition\), Level 1 \(character manipulation\), Level 2 \(morpheme decomposition\), Level 3 \(morpheme manipulation\), Level 4 \(morpheme composition\), Level 5 \(complex multi\-step transformations\)\. The hierarchy is designed to localize failures: Levels 3–5 presuppose that the model can decompose a word into morphemes \(Level 2\)\. Interpretation of per\-level results is discussed in §[5](https://arxiv.org/html/2606.15144#S5)\.

#### Prompting and instance structure\.

All instances use short, uniform templates with minimal instruction overhead\. GEN instances use a structured XML format with a<reflection\>reasoning trace followed by an<answer\>block; the evaluator extracts only the answer before scoring\. Full prompting details, chat\-template injection, and thinking\-mode handling are described in Appendix[B](https://arxiv.org/html/2606.15144#A2)\.

## 4Data Construction

All PACUTE tasks are generated deterministically from public lexical resources and hand\-curated annotations, without any use of generative models for content production\.111Code and benchmark data can be found at[https://github\.com/raileymontalan/pacute\-bench](https://github.com/raileymontalan/pacute-bench)\.This design choice is motivated by two considerations: \(i\) avoiding contamination of the benchmark with patterns already present in LLM pre\-training corpora, and \(ii\) ensuring that every instance has a verifiable gold answer derived from linguistic rules rather than model\-generated textbender\-etal\-2021\-parrots\.

Table 1:PACUTE task statistics by category and format\.LangGame= language\-agnostic control;MDA \(multi\-digit addition\)= non\-linguistic control;CUTE= character\-understanding control \(generative only\)\.#### Lexical resources\.

The primary source is a syllabified word list from theUP Diksiyonaryong Filipinoupdiksiyonaryo2001\(16,828 entries with syllable boundaries, stress, and POS\), paired with a Filipino corpus frequency list \(118,801 forms\) for frequency\-weighted sampling\. Morphological tasks draw from a manually curated dataset of inflection \(e\.g\.,takbo\+\+\-um\-→\\rightarrowtumakbo\), assimilation \(e\.g\.,naNG\-\+\+bigay→\\rightarrownamigay\), and reduplication patternszamar2022filipino;schachter1983tagalog\. Syllabification tasks use a handcrafted corpus of minimally contrastive word pairs in disambiguating sentence context \(e\.g\.,búhay\`\`life'' vs\.buháy\`\`alive''\)\. All task instances are generated by a deterministic Python pipeline; items are available in English and Filipino prompt formats and summarized in Table[1](https://arxiv.org/html/2606.15144#S4.T1)\.

## 5Evaluation

#### Overview\.

We use PACUTE to evaluate \(i\) a broad set of LLMs \(zero\-shot\) and \(ii\) a controlled continued\-pre\-training \(CPT\) study where we vary tokenization during Filipino\-domain adaptation\. We report normalized accuracy and F1 score on MCQ tasks \(4\-way; chance = 25%\) and string\-match accuracy on generative tasks\.

#### Models\.

We evaluate open\-weight models spanning several major families \(Gemma, GPT\-2, GPT\-OSS, Llama, Mistral, Phi, Qwen, SEA\-LION, DeepSeek, Kimi\), parameter scales from 124M to 1T, and both base pre\-trained, instruction\-tuned, and thinking variants\. We additionally evaluate frontier commercial models \(Claude 4 series, GPT\-5 series, and Gemini 3 series\) on generative benchmarks only \(commercial APIs for most models do not expose token log\-probabilities for MCQ scoring\)\. This mix allows us to separate effects of scale, instruction tuning, regional pre\-training data \(SEA\-LION\), and frontier capability\.

#### Scoring\.

All models are evaluated on the full PACUTE suite, hierarchical benchmark, and controls \(CUTE, LangGame, MDA\)\. For MCQ, we select the option with highest conditional log\-probability and report normalized accuracy and F1 \(4\-way; chance = 25%\)\. For GEN, we use exact match after lightweight normalization \(case, whitespace, leading/trailing hyphens on affix strings\) and contains\-match for tasks where models may produce extra context\. Per\-level hierarchical results \(L0–L5\) are interpreted diagnostically: if L2 \(morpheme decomposition\) is at chance, L3–L5 are expected to collapse regardless of L0–L1 performance\.

#### Continued pre\-training \(CPT\) setup\.

For CPT, we adapt Gemma\-2\-2B on the SEA\-PILE v2 Filipino corpus \(7\.4GB\) for up to 5,000 steps, saving checkpoints every 1,000 steps\. We compare three segmentation regimes applied at preprocessing time while keeping the underlying model architecture fixed: \(i\) vanilla \(unchanged BPE tokenization\), \(ii\) StochasTok \(stochastic token expansion;∼\\sim10% split rate\), \(iii\) Patok \(morphology\-aware expand/contract using Filipino affix rules\)\. We evaluate each checkpoint on the same benchmark suite\.

#### Human baseline\.

We establish a human baseline on a PACUTE subset using three native Filipino speakers, reporting average accuracy and Fleiss'κ\\kappafor inter\-annotator agreement \(Appendix[D](https://arxiv.org/html/2606.15144#A4)\)\.

## 6Results

CompositionManipulationM\. ExtractionM\. ProductionSyllabificationHierarchical05050100100Contains Match \(%\)Gemma\-4\-E2BPhi\-4Gemma\-3\-27BQwen\-3\-14BTGPT\-OSS\-120BTGPT\-5\.4\-MiniGPT\-5\-MiniGPT\-5\.5Sonnet\-4\.6Gemini\-3\.5\-FlashTDeepSeek\-R1TKimi\-K2TFigure 2:Contains\-match accuracy across PACUTE categoriesfor 12 models spanning open\-weight \(gray, 5B–120B\) and frontier commercial \(color\) tiers\. SuperscriptT= extended thinking/reasoning\. Composition and Manipulation approach ceiling for strong models, whileHierarchical and Syllabification remain suppressed across all classes\. Even GPT\-5\.5 \(77\.5% Hierarchical\-CM\) sits 22pp below its Manipulation ceiling\. DeepSeek\-R1 leads Hierarchical \(72\.7% CM\); Gemini\-3\.5\-Flash dominates Morphological Extraction and Production \(90\.5/90\.0% CM\) but flops on Hierarchical \(62\.2%\), suggesting morpheme operations and multi\-step composition recruit partially disjoint capabilities\.Full numerical results are reported in Tables[5](https://arxiv.org/html/2606.15144#A4.T5),[6](https://arxiv.org/html/2606.15144#A4.T6),[7](https://arxiv.org/html/2606.15144#A4.T7),[8](https://arxiv.org/html/2606.15144#A4.T8),[9](https://arxiv.org/html/2606.15144#A4.T9),[10](https://arxiv.org/html/2606.15144#A4.T10),[11](https://arxiv.org/html/2606.15144#A5.T11), and in the Appendix\. Figure[2](https://arxiv.org/html/2606.15144#S6.F2)summarizes the main pattern: strong models approach ceiling on character\-level and language\-agnostic controls, but remain substantially weaker on Filipino\-specific phonological and morphological tasks\. The gap is clearest on the Hierarchical benchmark and on Syllabification, which require models to use morpheme structure, syllable structure, stress, and phonologically conditioned transformations rather than simple character edits\.

#### Morpheme decomposition is difficult for open\-weight models\.

Among open\-weight models evaluated with MCQ log\-probability scoring, performance on the Hierarchical benchmark remains close to chance\. Normalized accuracy ranges from−3\.6\-3\.6to\+2\.9\+2\.9across 32 open\-weight variants \(mean:−0\.6\-0\.6\)\. This collapse is driven by Level 2 morpheme decomposition: models often fail to identify the root and affix structure of Filipino words, even when they perform above chance on lower\-level character recognition and manipulation tasks\. Scale alone does not remove this failure\. Larger open\-weight models, including Gemma\-4\-31B and Qwen\-3\.6\-27B, do not reliably outperform much smaller models on morpheme decomposition\.

#### Character\-level controls are largely solved by frontier models\.

Commercial models perform near ceiling on the easiest controls: MDA \(99\.4–100\.0% CM\), LangGame \(94\.9–100\.0%\), and Manipulation \(GPT\-5\.5: 100\.0%, DeepSeek\-R1: 96\.9% CM\)\. These results confirm that PACUTE's harder tasks reflect genuine morphological difficulty, not generic instruction\-following failure\.

#### Morphological extraction is recoverable for strong models, but remains formatting\-sensitive\.

Performance on Morphological Extraction is consistently higher under contains\-match than exact match \(e\.g\., Gemini\-3\.5\-Flash: 89\.5% EM / 90\.5% CM; Claude\-Sonnet\-4\.6: 74\.0% / 76\.2%\), indicating models often recover the correct affix but not always in the required form\.

#### Hierarchical transformations remain the central bottleneck\.

The Hierarchical benchmark remains substantially below the character\-level ceiling\. The best model, GPT\-5\.5, reaches 64\.7% EM and 77\.5% CM on Hierarchical\-GEN, compared with 100\.0% CM on Manipulation\-GEN\. DeepSeek\-R1 reaches 72\.7% CM, Kimi\-K2\-Thinking reaches 67\.2%, Claude Sonnet 4\.6 reaches 68\.0%, and GPT\-5\-Mini reaches 67\.7%\. This pattern suggests that the difficulty is not identifying individual characters or even recovering individual affixes, but composing several operations: decomposing a word into morphemes, manipulating those morphemes, and recombining them into the correct surface form\. PACUTE therefore isolates a gap between local subtoken access and productive morphological reasoning\.

#### Syllabification exposes a separate phonological weakness\.

Syllabification is difficult even for frontier models\. The best performers are GPT\-5\.5 \(83\.0% CM\) and Gemini\-3\.5\-Flash \(82\.5%\), while several strong models score much lower \(GPT\-5\-Mini: 46\.0%; Gemini\-3\.1\-Flash\-Lite: 38\.0%\)\. Even models that handle character manipulation well may still lack stable representations of Filipino syllable structure, stress, and thengdigraph\.

#### Reasoning models narrow but do not close the gap\.

Thinking models are often strong, but their gains are uneven\. DeepSeek\-R1 obtains one of the highest Hierarchical scores among commercial models \(72\.7% CM\), and Kimi\-K2\-Thinking also performs competitively \(67\.2% CM\)\. Gemini\-3\.5\-Flash is especially strong on Morphological Extraction, Morphological Production, and Syllabification, reaching 90\.5%, 90\.0%, and 82\.5% CM respectively, but remains lower on Hierarchical at 62\.2% CM\. These results suggest that reasoning can help when the relevant linguistic knowledge is available, but it does not by itself guarantee robust morpheme decomposition or composition\.

#### Pre\-training data matters below the frontier\.

Among open\-weight models, SEA\-LION models trained on Southeast Asian data outperform generic models of comparable scale on Filipino\-specific tasks\. SEA\-LION\-Qwen\-v4\.5\-27B achieves 85\.8 normalized accuracy on MProd\-MCQ and 27\.5 on Manip\-MCQ, compared with 83\.1 and 26\.2 for Qwen\-3\.6\-27B\. This suggests that Filipino\-relevant pre\-training data improves mid\-range model performance, even though frontier commercial models can partially compensate through general capability and stronger instruction following\.

#### Human baseline\.

Human performance is substantially above model performance on all categories\. Humans particularly outperform the strongest models on MExt\-MCQ \(0\.925 vs\. 0\.895 EM, MProd\-GEN \(0\.911 vs\. 0\.900 EM\), and Syll\-GEN \(0\.967 vs\. 0\.830 EM\)\. Overall agreement among annotators for PACUTE is high \(90\.4% for GEN, 85\.0% for MCQ\)\. Moreover, Fleiss'κ\\kappaof 85\.5% for MCQ and 74\.0% for GEN indicates almost perfect and substantial agreement respectively\.

#### Summary\.

PACUTE reveals a hierarchy of competence: strong models solve character edits and language\-agnostic controls, and often recover individual affixes, but remain well below their character\-level ceilings on Hierarchical and Syllabification tasks\. Scale, instruction tuning, and reasoning all help, but productive morphological composition \(such as decomposing words, tracking syllable structure, applying phonologically conditioned transformations\) remains the persistent bottleneck\.

## 7Error Analysis

We conduct a manual error analysis on sampled GEN responses from instruction\-tuned models, focusing on GEN because reasoning traces and surface\-form choices are visible there\. We identify four recurring error types\.

#### Type I: Instruction following failure\.

Many GEN errors reflect non\-compliance rather than a linguistic gap: generating a full sentence instead of a word, explaining instead of answering, or responding in English to a Filipino prompt\. Thinking\-mode models are especially prone, as the correct answer often appears mid\-chain but is not surfaced as the final response\. These errors inflate raw GEN error rates and motivate contains\-match scoring\.

#### Type II: Linguistic error\.

Models attempt the correct operation but apply rules incorrectly\. The most frequent subtype is infix placement error: the infix is identified \(e\.g\.,\-um\-\) but inserted at the wrong position \(e\.g\.,\*sulumatinstead ofsumulat\)\. Other subtypes include root boundary confusion,ng\-digraph mishandling in syllabification, and reduplication copying at the wrong edge\.

#### Type III: Reasoning inconsistency\.

Thinking models \(e\.g\., Qwen\-3 variants\) sometimes reach the correct intermediate conclusion in the scratchpad yet output a contradicting final answer, such as by correctly counting syllables then rationalizing a different number, or identifying the right root/affix then selecting a wrong MCQ option\. This suggests the final answer is not robustly conditioned on the chain\-of\-thought\.

#### Type IV: Systematic model bias\.

The most distinctive error class is apenultimate\-stress bias: models default to labeling the second\-to\-last syllable as stressed regardless of actual diacritic position\. Analogous biases appear in affix selection \(preference for high\-frequencymag\-even in illicit contexts\)\. Neither scale nor extended reasoning corrects these biases\. Models also fail to use sentential context for stress disambiguation—e\.g\., Gemma\-3\-27B\-IT answersgaforgalingin all contexts even when the sentence selects the noun reading requiring final stress—and treat the digraphngas two Unicode characters, causing systematic syllable overcounting\.

## 8Discussion

#### Morpheme decomposition as a training data problem\.

The near\-random performance on Level 2 decomposition tasks among open\-weight models points to a missing\-knowledge failure rather than an insufficient\-capacity failure\. Frontier commercial models partially overcome this floor, but their residual errors concentrate specifically on phonologically conditioned processes—nasal assimilation, vowel\-initial root alternations—that require productive rule application rather than memorization\. This suggests two stacked bottlenecks: \(i\) Filipino morphological structure is underrepresented in pre\-training corpora of English\-dominant models, and \(ii\) even when models acquire partial affix knowledge at scale, they fail to learn the phonological rules governing affix–root interactions\. BPE tokenization compounds both by fragmenting infixed and reduplicated forms at boundaries that obscure morpheme structure\. Composition tasks probing diacritic identification score notably lower than non\-diacritic character counting for most models, consistent with stress and glottal markers being nearly absent from everyday Filipino textalmario2014masinop—a gap with downstream consequences for stress disambiguation and lexical sense selection invisible to task\-oriented benchmarks\.

#### Why tokenization\-based interventions have limited impact\.

The CPT results with StochasTok and Patok \(Appendix[E](https://arxiv.org/html/2606.15144#A5)\) show that all three regimes, including vanilla BPE, underperform the un\-CPT'd base on the PACUTE main\-suite\. Crucially, Hierarchical accuracy remains pinned at the chance floor for all three regimes \(21\.2–24\.7%\), confirming that L2 morpheme decomposition is unmoved by either stochastic token expansion or morphology\-aware segmentation\. This is consistent with prior work showing that BPE\-dropout and similar methods improve robustness to surface variation but do not inject morphological abstraction when the underlying training signal is absentbostrom2020byte;sims2025stochastokimprovingfinegrainedsubword\. None of these CPT setups yield a net gain for Filipino morphology, and simply exposing the model to more Filipino text under alternative segmentation schemes is insufficient without mechanisms that preserve general capabilities or explicitly align token boundaries with morpheme boundaries at scale\.

#### Implications for low\-resource morphologically complex languages\.

PACUTE's diagnostic hierarchy provides a template for identifying where model competence breaks down for languages with non\-concatenative morphology\. The L0/L1 scores show that models can access character\-level information inside tokens for Filipino to a limited degree; the abrupt collapse at L2 precisely locates the gap\. This has practical implications: downstream Filipino NLP tasks that rely on aspect or voice distinctions \(which are morphologically encoded\) should not be expected to benefit from scale alone if morpheme decomposition is absent\. The SEA\-LION advantage suggests that domain\-specific pre\-training on high\-quality Filipino text is the most actionable intervention currently available\.

## Limitations

PACUTE evaluates word\-level morphological and character\-level competence in a controlled, synthetic setting\. Several limitations should be noted\.

#### Lexical resource coverage\.

All tasks are generated from a curated lexical resource and a manually annotated affix set\. The word pool \(16,828 entries from theUP Diksiyonaryong Filipino\) does not cover informal registers, code\-switching, or borrowings from Spanish and English, which are pervasive in natural Filipino text; infix instances in particular are underrepresented in the affixation task set, limiting the diagnostic power of infix\-specific categories in the hierarchical benchmark\.

#### Transfer to naturalistic tasks\.

Model performance on PACUTE may therefore not transfer directly to naturalistic Filipino NLP tasks\. Tasks that involve diacritic\-bearing forms assume that models have been exposed to or can infer stress distinctions\. However, diacritics are nearly always absent from digital Filipino text, meaning that performance on diacritic tasks may underestimate competence that a model could demonstrate if given diacritic\-marked input\.

#### Reproducibility of continued pretraining\.

The Gemma\-based experiments were conducted on an internal compute cluster using the NVIDIA NeMo training framework for continued pretraining; we do not release the continued pretraining code to avoid disclosing proprietary infrastructure details, though all pretraining runs used NeMo without custom modifications to the training loop or data pipeline\.

## Ethical Considerations

#### Data sources and licensing\.

All lexical data are drawn from published reference materials, namely the syllabified word list from theUP Diksiyonaryong Filipinoupdiksiyonaryo2001, morphological patterns fromzamar2022filipinoandschachter1983tagalog, and from the SEA\-PILE\-v2 Filipino text corpusng2025sealionsoutheastasianlanguages\. No personally identifiable information, social media data, or sensitive personal communications were used in task construction\. SEA\-PILE v2 is released under the ODC\-By 1\.0 license; we use it in accordance with its terms\. TheUP Diksiyonaryong Filipinois a copyrighted reference work published by the University of the Philippines\.

#### Intended use\.

PACUTE is intended as a diagnostic tool for identifying specific morphological competence gaps in LLMs, not as a general\-purpose measure of Filipino language ability\. PACUTE should not be used to rank models for production deployment in Filipino\-language applications without considering downstream task performance\.

#### Sociolinguistic scope\.

Filipino is the national language of the Philippines\. We acknowledge that the computational treatment of Filipino morphology in this work is primarily from a formal linguistic perspective and may not capture the full range of dialectal variation or the sociolinguistic complexity of language use in the Philippines\. We encourage follow\-up work that extends PACUTE\-style diagnostics to other Philippine languages \(e\.g\., Cebuano, Ilocano\) and to more naturalistic evaluation settings\.

#### Use of AI assistants\.

AI language models \(Claude\) were used for proofreading and editing of manuscript text\. No AI\-generated content was used for research design, data construction, or analysis\.

## References

## Appendix AFilipino Morphological Properties

We summarize the Filipino properties that motivate PACUTE's task design\.

#### Affixation and non\-concatenative processes\.

Filipino \(Tagalog\-based\) word formation is highly productive and relies on a large inventory of affixes, including prefixes \(e\.g\.,mag\-,nag\-,pag\-\), suffixes \(e\.g\.,\-an,\-in\), circumfixes \(e\.g\.,pag\-…\-an\), and infixes \(notably\-um\-and\-in\-\)schachter1983tagalog\. Infixes are inserted inside the root rather than concatenated at an edge, typically but not always after the onset of the first syllable:kain→\\rightarrowkumain,sulat→\\rightarrowsumulat/sinulat\. This insertion position is sensitive to the phonological shape of the root \(e\.g\., vowel\-initial roots tend to surface with an infix realized prefixally\), which complicates purely string\-based decomposition\.

#### Reduplication as a productive generator of novel forms\.

Filipino uses reduplication to mark grammatical aspect and related meaningszamar2022filipino\. Partial reduplication commonly copies the initial CV\(C\) of the root \(luto→\\rightarrowluluto,takbo→\\rightarrowtatakbo\), while full reduplication copies the entire root \(araw\-araw\)\. Because these patterns are productive, they generate many surface forms that may be rare or unseen as wholes, even when the root is frequent\.

#### Morphophonemic alternations at affix boundaries\.

Affixation in Filipino often triggers alternations that obscure a simple affix\-root boundaryschachter1983tagalog\. A canonical case is nasal assimilation in themang\-prefix, where the nasal assimilates to the place of articulation of the following consonant and may induce deletion or substitution \(e\.g\.,mang\-\+bili→\\rightarrowmamili\)\. More generally, attachment can involve phonologically conditioned changes that make decomposition a structured reasoning problem rather than a simple string\-splitting task\.

#### Stress, glottal stop, and diacritic omission\.

Filipino orthography can mark stress and glottal stop using diacritics \(acute, grave, circumflex\), yielding minimal pairs such asbása\`\`read'' vs\.basâ\`\`wet'',súka\`\`vomit'' vs\.sukà\`\`vinegar,'' andtáyo\`\`we'' vs\.tayô\`\`stand''yap1967synchronic\. However, these diacritics are rarely written in contemporary digital textalmario2014masinop\. For modeling, this creates a systematic mismatch between \(i\) lexicographic forms where distinctions are explicit and \(ii\) naturally occurring forms where distinctions are collapsed\. As a result, tasks that require sensitivity to stress and glottal structure can fail either because the model lacks the concept or because the surface form underdetermines the target\.

#### Implications for evaluation design\.

Taken together, infixation, reduplication, alternations, and diacritic omission create a setting where surface\-form competence is not sufficient for morphological competence\. A model may \(i\) recognize characters but fail to segment into morphemes, \(ii\) memorize frequent affixed words without learning productive rules, or \(iii\) handle concatenative affixes but fail on infixes because the relevant unit is non\-contiguous\. PACUTE is designed to disentangle these possibilities by pairing \(a\) character\-level controls, \(b\) targeted affixation and syllabification tasks, and \(c\) a hierarchical framework that localizes errors to a specific compositional level\.

## Appendix BPrompting and Instance Structure

MCQ instances score options by model log\-probability under a fixed prompt and require no special output format\. GEN instances use a structured XML output format: the model produces a<reflection\>block \(1–2 sentences of reasoning\) followed by an<answer\>block containing only the target form\. The evaluator extracts the content inside<answer\>and discards the reflection before computing exact\-match and contains\-match scores\. For instruction\-tuned models, the format instruction is injected as the system prompt via the tokenizer's chat template; for base models, it is prepended as plain text\. Thinking\-mode models additionally produce a<think\>block prior to the reflection, which is similarly stripped\.

The reflection requirement serves two purposes: \(i\) it provides a lightweight, always\-on reasoning trace enabling the qualitative error analysis in §[7](https://arxiv.org/html/2606.15144#S7); and \(ii\) it encourages the model to commit to an explicit rationale before producing its answer, which is informative for diagnosing Type III errors \(reasoning inconsistency\)\. Prompts are intentionally simple and uniform—no few\-shot examples, no chain\-of\-thought elicitation beyond the reflection format—to preserve comparability across base and instruction\-tuned models\.

## Appendix CModel List

Tables[2](https://arxiv.org/html/2606.15144#A3.T2)to[4](https://arxiv.org/html/2606.15144#A3.T4)list all models evaluated in this work with parameter counts, variant types, and references\. IT = instruction\-tuned; PT = pretrained \(base\); Thinking = extended chain\-of\-thought mode\.

Table 2:Base pre\-trained models\. Model names link to HuggingFace model pages\.Table 3:Instruction\-tuned and reasoning variants\.IT= standard instruction\-tuned only;IT \+ Thinking= evaluated in both standard and thinking modes;Thinking= thinking\-only model\. Model names link to HuggingFace model pages\.FamilyModelTypeParamsReferenceGemma\-3[google/gemma\-3\-4b\-it](https://huggingface.co/google/gemma-3-4b-it)IT4Bgoogle2025gemma3[google/gemma\-3\-12b\-it](https://huggingface.co/google/gemma-3-12b-it)IT12B[google/gemma\-3\-27b\-it](https://huggingface.co/google/gemma-3-27b-it)IT27BGemma\-4[google/gemma\-4\-26B\-A4B\-it](https://huggingface.co/google/gemma-4-26B-A4B-it)IT \+ Thinking26Bgoogle2026gemma4[google/gemma\-4\-31B\-it](https://huggingface.co/google/gemma-4-31B-it)IT \+ Thinking31B[google/gemma\-4\-E2B\-it](https://huggingface.co/google/gemma-4-E2B-it)IT \+ Thinking5B[google/gemma\-4\-E4B\-it](https://huggingface.co/google/gemma-4-E4B-it)IT \+ Thinking8BGPT\-OSS[openai/gpt\-oss\-20b](https://huggingface.co/openai/gpt-oss-20b)Thinking20Bopenai2025gptoss120bgptoss20bmodel[openai/gpt\-oss\-120b](https://huggingface.co/openai/gpt-oss-120b)Thinking120BMistral\-Small\-4[mistralai/Mistral\-Small\-4\-119B\-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603)IT \+ Thinking119Bmistral2026mistralsmallPhi\-4[microsoft/Phi\-4\-mini\-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct)IT3\.8Bmicrosoft2025phi4minitechnicalreportcompact[microsoft/Phi\-4](https://huggingface.co/microsoft/phi-4)IT14Babdin2024phi4technicalreportQwen\-3[Qwen/Qwen3\-4B](https://huggingface.co/Qwen/Qwen3-4B)IT \+ Thinking4Byang2025qwen3technicalreport[Qwen/Qwen3\-8B](https://huggingface.co/Qwen/Qwen3-8B)IT \+ Thinking8B[Qwen/Qwen3\-14B](https://huggingface.co/Qwen/Qwen3-14B)IT \+ Thinking14B[Qwen/Qwen3\-32B](https://huggingface.co/Qwen/Qwen3-32B)IT \+ Thinking32BQwen\-3\.6[Qwen/Qwen3\.6\-27B](https://huggingface.co/Qwen/Qwen3.6-27B)IT \+ Thinking0\.6Bqwen2026qwen36SEA\-LION\-4[aisingapore/Gemma\-SEA\-LION\-v4\.5\-E2B\-IT](https://huggingface.co/aisingapore/Gemma-SEA-LION-v4.5-W2B-IT)IT \+ Thinking5Bsealion2026sealion45[aisingapore/Qwen\-SEA\-LION\-v4\.5\-27B\-IT](https://huggingface.co/aisingapore/Qwen-SEA-LION-v4.5-27B-IT)IT \+ Thinking27BTable 4:Commercial frontier models\.ProviderModelParamsReferenceAnthropicOpus 4\.6\-anthropic2026claudeopus46Sonnet 4\.6\-anthropic2026claudesonnet46Haiku 4\.5\-anthropic2026claudehaiku45DeepSeekDeepSeek R1685Bdeepseekai2025deepseekr1DeepSeek 3\.2685Bdeepseekai2025deepseekv32GoogleGemini\-3\.5\-Flash\-google2026gemini35flashGemini\-3\.1\-Flash\-Lite\-google2026gemini31flashliteMoonshootAIKimi K21\.1Tmoonshot2025kimik2OpenAIGPT\-5\.5\-openai2026gpt55GPT\-5\.4\-Mini\-openai2026gpt54miniGPT\-5\.4\-Nano\-openai2026gpt54nanoGPT\-5\-Mini\-openai2025gpt5mini
## Appendix DEvaluation Results

Table 5:Human baseline performance and inter\-annotator agreement on theMCQformat of PACUTE \(three annotators\)\.Mean Accis accuracy averaged across annotators \(±\\pmstd\)\.Raw Agreeis the fraction of items on which all three annotators selected the same option\.Fleissκ\\kappais computed over the four\-class option labels \(κ≥0\.81\\kappa\\\!\\geq\\\!0\.81almost perfect;0\.610\.61–0\.800\.80substantial\)\.MCQ Avgis the macro\-average across all five benchmarks\.Table 6:Human baseline performance and inter\-annotator agreement on theGENformat of PACUTE \(three annotators\)\.Mean EMis exact\-match accuracy averaged across annotators \(±\\pmstd\)\.Raw Agreeis the fraction of items on which all three annotators produced identical normalized strings\.Fleissκ\\kappais computed on the binarized correct/incorrect judgment against the reference answer \(κ≥0\.81\\kappa\\\!\\geq\\\!0\.81almost perfect;0\.610\.61–0\.800\.80substantial\)\.GEN Avgis the macro\-average across all five benchmarks\.Table 7:Results for PT models on PACUTE benchmarks\. All values are percentages\.- •Comp= PACUTE Composition;Manip= PACUTE Manipulation;MExt= PACUTE Morphological Extraction;MProd=
- •PACUTE Morphological Production;Syll= PACUTE Syllabification;Hier= Hierarchical diagnostic;LGame= LangGame
- •\(language\-agnostic control\);MDA= Multi\-Digit Addition \(catastrophic\-forgetting probe\); Bold = best in column\. Cell color:
- •low→\\towhite→\\tohighper column\.

Table 8:Results for IT models on PACUTE MCQ benchmarks\. All values are percentages\.- •Comp= PACUTE Composition;Manip= PACUTE Manipulation;MExt= PACUTE Morphological Extraction;MProd=
- •PACUTE Morphological Production;Syll= PACUTE Syllabification;Hier= Hierarchical diagnostic;LGame= LangGame
- •\(language\-agnostic control\);MDA= Multi\-Digit Addition \(catastrophic\-forgetting probe\); Bold = best in column\. Cell color:
- •low→\\towhite→\\tohighper column\.

Table 9:Results for IT models on PACUTE GEN benchmarks\. EM = exact match; CM = contains match\. All values are percentages\.- •Comp= PACUTE Composition;Manip= PACUTE Manipulation;MExt= PACUTE Morphological Extraction;MProd=
- •PACUTE Morphological Production;Syll= PACUTE Syllabification;Hier= Hierarchical diagnostic;LGame= LangGame
- •\(language\-agnostic control\);MDA= Multi\-Digit Addition \(catastrophic\-forgetting probe\); CUTE = character\-level
- •understanding control\)\. Bold = best in column\. Cell color:low→\\towhite→\\tohighper column\.

Table 10:Results for commercial models on PACUTE GEN benchmarks\. EM = exact match; CM = contains match\. Commercial APIs do not expose token log\-probabilities, so MCQ benchmarks are omitted\. All values are percentages\.- •Comp= PACUTE Composition;Manip= PACUTE Manipulation;MExt= PACUTE Morphological Extraction;MProd=
- •PACUTE Morphological Production;Syll= PACUTE Syllabification;Hier= Hierarchical diagnostic;LGame= LangGame
- •\(language\-agnostic control\);MDA= Multi\-Digit Addition \(catastrophic\-forgetting probe\); CUTE = character\-level
- •understanding control\)\. Bold = best in column\. Cell color:low→\\towhite→\\tohighper column\.

## Appendix EContinued Pre\-training Results

Results from continued pre\-training of Gemma\-2\-2B on SEA\-PILE v2 Filipino corpus \(∼\\sim7\.4GB\) across three tokenization regimes: Vanilla \(standard BPE\), StochasTok \(∼\\sim10% stochastic token expansion\), and Patok \(morphology\-aware expand/contract\)\. MCQ chance = 25% \(4\-way\)\. All CPT checkpoints lose math capability entirely \(catastrophic forgetting\), confirming the multi\-digit addition benchmark as a reliable domain\-adaptation health check\.

Table 11:MCQ benchmark accuracy for Gemma\-2\-2B and its three CPT regimes on SEA\-PILE v2 \(5,000 steps\)\. All values are raw 4\-way accuracy percentages \(chance = 25%\)\. Bold = best per column\. CPT regimes broadly underperform the un\-CPT'd base on character\-level tasks and the LangGame/MDA controls, with MDA in particular dropping from 68\.7% to 14–18% \(catastrophic forgetting on arithmetic\)\.- •Comp= PACUTE Composition;Manip= PACUTE Manipulation;MExt= PACUTE Morphological Extraction;MProd=
- •PACUTE Morphological Production;Syll= PACUTE Syllabification;PACUTE MCQ Avg\.is the macro\-average of the
- •PACUTE MCQ main suite;Hier= Hierarchical diagnostic;LGame= LangGame \(language\-agnostic control\);MDA= Multi\-
- •Digit Addition \(catastrophic\-forgetting probe\); Bold = best in column\.

## Appendix FError Analysis: Annotated Examples

Tables[12](https://arxiv.org/html/2606.15144#A6.T12)–[15](https://arxiv.org/html/2606.15144#A6.T15)provide representative model outputs for each of the four error types described in §[7](https://arxiv.org/html/2606.15144#S7)\. All examples are drawn from GEN inference outputs of instruction\-tuned models\. Task inputs are quoted verbatim; model outputs are lightly formatted for readability \(angle\-bracket tags are from the model's structured output format\)\.

Table 12:Type I: Instruction following failure\. Model produces a malformed output—here, a character\-insertion task yields garbled output with spurious spaces rather than the target string\.Table 13:Type II: Linguistic error\. The model correctly identifies that\-in\-is an infix and that it should be placed inside the root, but inserts it at the wrong position \(takinbovs\. the correcttinakbo, where the infix follows the initial consonant\)\.Table 14:Type III: Reasoning inconsistency\. The model's chain\-of\-thought correctly deriveskinain\(\`\`k \+ in \+ ain''\) but then dismisses this intermediate result and outputskalininas the final answer\.Table 15:Type IV: Systematic penultimate\-stress bias\. The model explicitly asserts that Filipino stress defaults to the penultimate syllable and selectskaaccordingly\. In the given sentence,kayameans \`\`which is why'' \(kayà\), which carries final stress onya\.

Similar Articles

QuechuaTok: Morphological Boundary Accuracy as a Necessary Metric for Tokenizer Evaluation in Agglutinative Low-Resource Languages

arXiv cs.CL

This paper presents QuechuaTok, a benchmark for evaluating tokenization strategies for Southern Quechua, and introduces Morphological Boundary Accuracy (MorphAcc) as a necessary metric. It shows that BPE achieves low fertility but poor morphological accuracy, while a morphology-aware PRPE tokenizer achieves 83% MorphAcc, demonstrating that fertility rate alone is insufficient for agglutinative languages.

Optimizing Korean-Centric LLMs via Token Pruning

arXiv cs.CL

This paper presents a systematic benchmark of token pruning—a compression technique that removes tokens and embeddings for irrelevant languages—applied to Korean-centric LLM tasks. The study evaluates popular multilingual models (Qwen3, Gemma-3, Llama-3, Aya) across different vocabulary configurations and finds that token pruning significantly improves generation stability and reduces memory footprint for domain-specific deployments.

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

arXiv cs.CL

This paper proposes a dialect-aware phonetic framework for modeling phonetic variation in Vietnamese ASR, decomposing syllables into structured components and mapping them to dialect-specific IPA representations. The approach matches pretrained baselines with fewer parameters and no external pretraining on the UIT-ViMD multi-dialect dataset.