When Irregularity Helps: A Subclass Analysis of Inductive Bias in Neural Morphology

arXiv cs.CL Papers

Summary

This paper investigates how character-level transformer models generalize to irregular verb subtypes in Japanese past-tense inflection. Controlled experiments show that including irregular examples can improve generalization, challenging the assumption that regularity simplifies learning.

arXiv:2605.20558v1 Announce Type: new Abstract: Neural morphological generation systems often achieve high aggregate accuracy on benchmark datasets, yet such performance can conceal systematic errors concentrated in rare morphological subclasses. We examine Japanese past-tense verb inflection and show that a very small, structurally specific irregular subtype (<1% of data) accounts for a disproportionate share of model errors. Controlled ablation experiments demonstrate that removing this subtype yields larger improvements in generalization than removing all irregular verbs, indicating that not all irregularity contributes equally to model instability. These findings suggest that error concentration is driven by the interaction between extreme low-frequency morphological patterns and specific morphophonological processes, particularly gemination. We argue that morphological evaluation should incorporate finer-grained subclass analysis beyond standard conjugation categories.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:33 AM

# A Subclass Analysis of Inductive Bias in Neural Morphology
Source: https://arxiv.org/html/2605.20558
Type 4 \(Other irregular verbs\):Verbs deviating from standardGodanorIchidanpatterns\. Subtypes capture finer\-grained orthographic variation:

–Type 4\-1:Stem\-final /i/ \+ gemination\. These verbs resemble Type 2 but involve gemination at the boundary between the stem\-final /i/ and past\-tense suffix\-ta\. Example: activate=falseまじるmajiru‘to mix’ → activate=falseまじったmajitta‘mixed’\. Count: 119

–Type 4\-2:Stem\-final /e/ \+ gemination\. Gemination occurs at the boundary between stem\-final /e/ and the past\-tense suffix\-ta, increasing structural complexity\. Example: activate=falseあきれかえるakirekaeru‘to be shocked’ → activate=falseあきれかえったakirekaetta‘was shocked’\. Count: 37

–Type 4\-3:Localized deviations\. These verbs largely follow Type 1 formation but include idiosyncratic stem behavior\. Due to the one\-to\-one lemma–form constraint, only one instance appears\. Example: activate=falseいくiku‘to go’ → activate=falseいったitta‘went’\. Count: 1

Table1 (https://arxiv.org/html/2605.20558#S2.T1)summarizes the dataset distribution across verb types\.

Verb TypeCountProportion \(%\)All verbs3,958100Type 1 \(Godan\)2,50363\.2Type 2 \(Ichidan\)1,29832\.8Type 3 \(Canonical irregular\)00Type 4 \(Other irregular\)1574\.04\-1 \(/i/ \+ gemination\)1193\.04\-2 \(/e/ \+ gemination\)370\.94\-3 \(localized\)10\.02Table 1:Dataset statistics by verb type\. Subtypes of Type 4 are listed in parentheses for brevity\.## 3\. Models

We evaluate two character\-level transformer encoder–decoder models for Japanese past\-tense inflection\. The first follows the SIGMORPHON 2020 baseline\(Vylomovaet al\.,2020b (https://arxiv.org/html/2605.20558#bib.bib15)\), and the second is based on the lemma\-split evaluation from SIGMORPHON–UniMorph 2023\(Goldmanet al\.,2023 (https://arxiv.org/html/2605.20558#bib.bib5)\), which prevents lemmas from appearing in both training and test sets\. Both models operate overhiraganastrings, capturing consonant gemination, vowel alternation, and other orthographic phenomena, making them suitable for analyzing rare structural irregularities in the dataset\.

## 4\. Experimental Setup

### 4\.1\. Training Regime

Training for both models follows the default hyperparameter configurations provided in their respective shared task baselines\(Vylomovaet al\.,2020b (https://arxiv.org/html/2605.20558#bib.bib15); Goldmanet al\.,2023 (https://arxiv.org/html/2605.20558#bib.bib5)\)\. Models are trained using cross\-entropy loss with teacher forcing\. Optimization employs the Adam algorithm\(Kingma and Ba,2015 (https://arxiv.org/html/2605.20558#bib.bib26)\)with standard transformer learning rate scheduling\(Vaswaniet al\.,2017 (https://arxiv.org/html/2605.20558#bib.bib14)\)\. Random seeds are fixed to ensure reproducibility across experiments\.

### 4\.2\. Controlled Dataset Conditions

To systematically evaluate the impact of irregular subtypes on model generalization, we conduct controlled experiments with curated subsets of the dataset:

- \*Full Dataset \(Types 1–4\):Includes all regular and irregular verbs, representing the natural distribution of structural subgroups\.
- \*Regular Only \(Types 1–2\):Excludes all irregular verbs, removing orthographically complex minority subgroups\.
- \*Regular \+ Individual Irregular Subtypes:Types 1–2 plus one of 4\-1, 4\-2, or 4\-3, isolating the contribution of each irregular subtype\.
- \*Regular \+ Subtype Combinations:Selected combinations of 4\-1, 4\-2, and 4\-3 are added to assess interaction effects among minority subgroups\.

#### Evaluation Alignment\.

For each ablation, the same verb types are removed from training and test sets, ensuring that observed differences reflect the effects of structural subgroups rather than dataset volume\.

### 4\.3\. Evaluation Metrics

Evaluation follows established SIGMORPHON conventions, supplemented by subgroup\-level diagnostics to address the limitations of aggregate reporting\. Our metrics include:

- \*Exact\-Match Accuracy:The percentage of predicted forms identical to gold targets\(Cotterellet al\.,2017 (https://arxiv.org/html/2605.20558#bib.bib4); Goldmanet al\.,2023 (https://arxiv.org/html/2605.20558#bib.bib5)\)\.
- \*Subgroup Accuracy:Accuracy computed separately for each verb type to reveal concentrated errors in structural minority subgroups, such as rare irregulars, a technique frequently used to diagnose systematic morphological failures\(Kann and Schütze,2016 (https://arxiv.org/html/2605.20558#bib.bib7); Makarov and Clematide,2018 (https://arxiv.org/html/2605.20558#bib.bib9); Vylomovaet al\.,2020b (https://arxiv.org/html/2605.20558#bib.bib15)\)\.
- \*Disparity Ratio:To quantify the disproportionate concentration of errors across subgroups, we define the*Disparity Ratio*for a subgroupggas: Disparity Ratiog=Error SharegData Shareg\\text\{Disparity Ratio\}\_\{g\}=\\frac\{\\text\{Error Share\}\_\{g\}\}\{\\text\{Data Share\}\_\{g\}\}This ratio measures the relative error burden of a subgroup; a value greater than 1 signifies that the subgroup suffers from disproportionately high error rates compared to its prevalence in the data\. Similar subgroup\-specific performance disparities have been explored in identity\-aware AI and general predictive modeling\(Buolamwini and Gebru,2018 (https://arxiv.org/html/2605.20558#bib.bib3); Sagawaet al\.,2020 (https://arxiv.org/html/2605.20558#bib.bib13); Blodgettet al\.,2020 (https://arxiv.org/html/2605.20558#bib.bib2)\)\.

This evaluation strategy adopts fairness diagnostics from the broader field of NLP, illustrating why aggregate metrics often mask localized weaknesses in linguistic structure\(Blodgettet al\.,2020 (https://arxiv.org/html/2605.20558#bib.bib2); Lake and Baroni,2018 (https://arxiv.org/html/2605.20558#bib.bib8); Marcus,2018 (https://arxiv.org/html/2605.20558#bib.bib10)\)\.

## 5\. Results

### 5\.1\. Baseline Performance

Under full training conditions, both systems achieve high aggregate accuracy on Japanese past\-tense inflection:

- \*SIGMORPHON 2020: 97\.98%
- \*SIGMORPHON 2023: 97\.73%

Despite high aggregate accuracy, errors are concentrated in specific low\-frequency subclasses\.

### 5\.2\. Subtype\-Specific Ablation Effects

To assess the contribution of individual irregular subclasses, we remove each Type 4 subgroup independently from both training and evaluation data\. Removing Type 4\-2 yields the largest performance gains:

- \*2020: 97\.98% → 99\.98% \(\+2\.00\)
- \*2023: 97\.73% → 99\.75% \(\+2\.02\)

This corresponds to an error reduction of approximately 99% relative to baseline error mass for the 2020 system and 88% for the 2023 system\.111Error reduction computed relative to\(100−accuracy\)\(100\-\\text\{accuracy\}\)\.

In contrast, removing other irregular subclasses produces substantially smaller improvements\. Eliminating all irregular verbs \(Type 4\) does not produce maximal accuracy, indicating that performance gains are not driven uniformly by irregularity removal\.

### 5\.3\. Empirical Distribution of Errors

Table 2:Observed errors under full training conditions for both SIGMORPHON 2020 and 2023 models\. Error counts highlight the concentration of failures in low\-frequency irregular subtypes\.Table2 (https://arxiv.org/html/2605.20558#S5.T2)summarizes the distribution of observed errors under full training conditions for both SIGMORPHON 2020 and 2023 models\. Presenting both models side by side highlights the consistency of error patterns across architectures\.

Across both architectures, Type 4\-2 verbs account for a disproportionately high share of gemination\-related failures relative to their dataset frequency\. Most errors involve omission of the small activate=falseっtsu\(e\.g\., activate=falseあまがけったamagaketta→ activate=falseあまがけたamagaketa\) or spurious insertion \(activate=falseできたdekita→ activate=falseできったdekitta\)\. These concentrated failures persist across ablation variants that retain Type 4\-2, while removing Type 4\-2 sharply reduces total errors\. In contrast, Type 4\-1 and 4\-3 contribute comparatively few errors relative to their dataset share\. This asymmetry demonstrates that structural difficulty is driven not by irregular status alone but by the interaction of frequency, phonological conditioning, and orthographic realization\.

### 5\.4\. Error Concentration Across Subclasses

We next examine the distribution of errors by verb type under full training\. Although Type 4\-2 constitutes only 0\.9% of the dataset, it accounts for 15\.8% of total errors in the 2020 system\.

Disparity Ratios \(see Section 4\.3\.\) indicate that Type 4\-2 contributes over seventeen times its proportional representation \(17\.56×\) to total errors\. By comparison:

- \*Type 1: Ratio 0\.80
- \*Type 2: Ratio 0\.50

Both architectures exhibit consistent error patterns, indicating that concentrated failures in Type 4\-2 are model\-independent\.

Majority subclasses therefore generate fewer errors than expected under a uniform distribution, whereas Type 4\-2 generates substantially more\. This pattern is consistent in the 2023 system, indicating that it is architecture\-independent\.

### 5\.5\. Qualitative Error Patterns

Across both models, the dominant failure pattern in Type 4\-2 involves incorrect handling of consonant gemination in past\-tense formation\. Errors include:

- \*Omission of required gemination \(e\.g\., activate=falseねがえったnegaetta→ activate=falseねがえたnegaeta\)
- \*Spurious gemination \(e\.g\., activate=falseできたdekita→ activate=falseできったdekitta\) These patterns account for the majority of errors within the subclass, suggesting that the instability is structurally localized rather than random\. ## 6\. Error Analysis Beyond quantitative accuracy metrics, we conducted fine\-grained error analysis across all experimental conditions\. We manually examined residual prediction errors from both the 2020 and 2023 models under full and ablated training regimes\. Errors were categorized into gemination errors, stem alternation errors, morpheme boundary errors, over\-regularization, and vowel length errors\. As shown in Table2 (https://arxiv.org/html/2605.20558#S5.T2), Type 4\-2 accounts for a disproportionately high share of errors\. ### 6\.1\. Error Taxonomy Table6\.1 (https://arxiv.org/html/2605.20558#S6.SS1)presents the error taxonomy, highlighting the orthographic phenomena underlying each error type and their dominant structural sources\. Table 3:Error taxonomy with dominant structural sources and orthographic properties\.### 6\.2\. Gemination and Stem Alternation Errors Gemination errors are the most frequent and structurally revealing\. They arise from omission or spurious insertion of the small activate=falseっtsucharacter, particularly in verbs where gemination interacts with stem\-final /e/ vowels\. Under full training, Type 4\-2 verbs account for the majority of gemination\-related failures despite constituting less than 1% of the dataset\. This concentration is visible in both 2020 and 2023 systems, and persists under multiple ablation regimes unless Type 4\-2 itself is removed\. Stem alternation errors occur when the model fails to apply expected vowel\-conditioned alternations \(e\.g\., activate=falseあきれかえるakirekaeru→ activate=falseあきれかえうakirekaeu\)\. These errors are primarily associated with irregular subclasses 4\-2 and 4\-3, indicating that alternation and gemination interact in destabilizing ways for the model\. ### 6\.3\. Effects of Selective Ablation on Error Patterns Ablation results further clarify this pattern\. When all irregular verbs are removed, overall error count decreases moderately\. However, when only Type 4\-2 is removed, error count decreases more substantially, and gemination\-related failures nearly disappear\. By contrast, removing 4\-1 or 4\-3 while retaining 4\-2 does not eliminate concentrated gemination errors\. This asymmetry indicates that structural complexity is not uniformly distributed across irregular subclasses\. ## 7\. Discussion Our analysis demonstrates that irregularity is not uniformly detrimental to neural morphological learning\. Instead, its impact depends on structural complexity, distributional frequency, and interaction with the model’s inductive biases\. A specific low\-frequency irregular subtype emerges as a structurally distinct case that disproportionately contributes to overall error mass\. Despite representing less than 1% of the training data, it accounts for a substantial share of systematic failures\. This concentration suggests that instability arises from specific morphophonological configurations rather than irregularity in general\. Crucially, removing the entire irregular set \(Type 4\) does not maximize performance\. Retaining other irregular subtypes \(4\-1 and 4\-3\) produces lower error rates than a purely regular training regime\. This suggests a non\-monotonic relationship between structural variability and generalization\. However, extremely low\-frequency, structurally idiosyncratic patterns—such as Type 4\-2—are associated with reduced generalization stability\. From a distributional perspective, Type 4\-2 functions as a rare but highly influential morphological pattern\. Although it represents less than 1% of the training data, it contributes disproportionately to model errors\. Aggregate accuracy \( 98%\) obscures this effect, which becomes visible only under subtype\-level evaluation\. This pattern highlights a general limitation of aggregate evaluation in morphological modeling: overall performance can mask concentrated weaknesses in rare but structurally complex subclasses\. Only fine\-grained analysis reveals these effects\. Methodologically, these results suggest that evaluation in morphological generation should incorporate subtype\-level reporting and explicit measures of error concentration\. Aggregate metrics alone may conceal linguistically meaningful weaknesses, particularly in morphologically rich languages where structural subclasses vary significantly in frequency and complexity\. ## 8\. Conclusion We presented a subgroup\-aware analysis of Japanese past\-tense inflection, examining how minority structural subclasses influence neural generalization\. Through controlled ablation experiments, we showed that: - ·Type 4\-2 irregular verbs constitute a low\-frequency morphological subclass with disproportionate error concentration\. - ·Removing only this subtype improves performance more than removing all irregular verbs\. - ·Moderate irregularity may support generalization, while extremely low\-frequency irregular patterns can destabilize learning\. These findings demonstrate that high aggregate accuracy can mask structural effects within morphological systems\. Evaluation should therefore go beyond overall performance and include analysis of rare morphological subclasses\. More broadly, these results underscore the importance of distribution\-sensitive evaluation in neural NLP\. Robust language modeling requires not only high average performance but stable generalization across rare and structurally complex subclasses\. Incorporating fine\-grained diagnostic analysis into morphological benchmarking can improve both transparency and linguistic validity of evaluation\. ## 9\. Limitations Several limitations should be acknowledged\. First, our study focuses on a single language and a single morphological task \(past\-tense inflection\)\. Although Japanese provides a controlled environment for examining structural effects in morphological learning, cross\-linguistic validation is necessary to determine generality\. Second, we evaluate two transformer\-based architectures derived from shared tasks\. While these represent strong baselines, alternative architectures—such as pretrained character\-level language models or multilingual systems—may exhibit different sensitivity to low\-frequency morphological patterns\. ## 10\. Future Work Several

Similar Articles

A mathematical theory of balancing relational generalization and memorization

arXiv cs.LG

This paper introduces a novel task, transitive inference with exceptions, and analytically characterizes how neural network models (kernel ridge regression) balance relational generalization and memorization. The theory is validated in pretrained language models, showing systematic mistakes predicted by the theory.

An In-Vitro Study on Cross-Lingual Generalization in Language Models

arXiv cs.CL

This paper introduces an in-vitro framework with two procedurally generated languages to study cross-lingual generalization in language models, finding that tokenization's preservation of reusable substructure is more critical than lexical similarity or data balance for transferring capabilities across languages.