MorfFlex: Handling Rich Morphology
Summary
This paper presents MorfFlex, a morphological dictionary architecture for languages with rich inflection and derivation, exemplified by MorfFlex CZ for Czech, which contains over 100 million wordforms and supports annotation consistency and NLP tools.
View Cached Full Text
Cached at: 06/24/26, 07:46 AM
# MorfFlex: Handling Rich Morphology
Source: [https://arxiv.org/html/2606.24366](https://arxiv.org/html/2606.24366)
###### Abstract
We present MorfFlex, a morphological dictionary architecture suitable for languages with extensive regularity in both inflection and derivation\. As the primary example of MorfFlex in use we introduce MorfFlex CZ, a morphological dictionary of Czech\. It is distributed as a simple, unstructured list of <wordform, lemma, tag\> triplets, however, its manually maintained, unpublished source files and conversion scripts encode a sophisticated system of inflectional and derivational patterns\. These patterns dramatically reduce the otherwise enormous size of the dictionary, which currently contains over 100 million wordforms and more than 1 million lemmas\. The MorfFlex CZ dictionary serves as an essential resource for ensuring the consistency of manual morphological annotation in the Prague Dependency Treebanks and underpins state\-of\-the\-art automatic tools such as MorphoDiTa\. In this paper, we focus on: \(i\) presenting an effective method for managing the rich morphological system within the dictionary, and \(ii\) demonstrating the utility of such a language resource for maintaining annotation consistency in corpora and supporting the development of advanced NLP applications\.
Keywords:morphological dictionary, inflection, derivation, corpus
\\NAT@set@cites
MorfFlex: Handling Rich Morphology
Jaroslava Hlaváčová, Marie Mikulová, Barbora Štěpánková,Milan Straka, Jan HajičCharles University, Faculty of Mathematics and Physics, Institute of Formal and Applied LinguisticsMalostranské náměstí 25, 118 00 Prague 1, Czech Republic\{hlavacova,mikulova,stepankova,straka,hajic\}@ufal\.mff\.cuni\.czAbstract content
## 1\. Introduction
Over the course of many years of annotation work, numerous valuable linguistic resources \(corpora and dictionaries\) have been developed, forming the foundation for a wide range of successful applications and tools in the field of NLP\. However, today the future of linguistic annotation lies at the intersection of traditional computational linguistics and modern approaches to natural language processing and artificial intelligence\. The role of manual linguistic annotation in application development is steadily diminishing, and annotated resources are required \(if at all\) only in exceptional cases\. This naturally raises the question of whether manual annotation still has a real purpose in the era of large language models\. Drawing on several decades of experience with the development of linguistically annotated resources for both theoretical and applied research, we remain firmly convinced that a formalized, reusable source of linguistic knowledge is still a highly desirable goal of computational linguistics\.
Table 1:MorfFlex CZ: a sample of <wordform, lemma, tag\> triplets for the paradigmlétat‘to\-fly’\. The complete paradigm contains 106 triplets and 94 unique wordforms \(some wordforms are homonymous, e\.g\. the wordformlétalyin this sample\)\.In this contribution, we presentMorfFlex CZ,111[https://ufal\.mff\.cuni\.cz/morfflex](https://ufal.mff.cuni.cz/morfflex)a morphological dictionary of Czech, which is a highly inflectional language\. The dictionary was originally developed by Jan Hajič in the late 1980sHajič \([2004](https://arxiv.org/html/2606.24366#bib.bib9)\)as a basis for a spelling checker\. Later, it was adapted for commercial applications, particularly lemmatization to support more efficient text search\(Hajič and Drozd,[1990](https://arxiv.org/html/2606.24366#bib.bib15)\)\. The dictionary then served as a key resource for manual morphological annotation of 1\.7 million tokens in the Prague Dependency Treebank, a pioneering initiative and one of the very first treebanks and tools ever created\. Even before the official release in 2001Hajič et al\. \([2001](https://arxiv.org/html/2606.24366#bib.bib11)\), an earlier version, annotated at both the morphological and syntactic layers and comprising approximately 400,000 words, was used in 1998 at the Johns Hopkins University workshop\. The corpus data provided training material for the first statistical taggers\(Hajič et al\.,[1998](https://arxiv.org/html/2606.24366#bib.bib14)\)and syntactic parsers\(Collins et al\.,[1999](https://arxiv.org/html/2606.24366#bib.bib6); Charniak,[2000](https://arxiv.org/html/2606.24366#bib.bib4)\)\.
To date, the dictionary continues to be developed and manually maintained mainly as an essential resource to ensure the consistency of manual morphological annotation in the Prague Dependency Treebanks\. The latest version isMorfFlex CZ 2\.1222[http://hdl\.handle\.net/11234/1\-5833](http://hdl.handle.net/11234/1-5833)\(Hajič et al\.,[2024b](https://arxiv.org/html/2606.24366#biba.bib2)\), fully compatible with thePrague Dependency Treebank \- Consolidated 2\.0annotation \(Hajič et al\.,[2024a](https://arxiv.org/html/2606.24366#biba.bib1);Mikulová et al\.,[2026](https://arxiv.org/html/2606.24366#bib.bib20)\)\. It also serves as the basis for state\-of\-the\-art automatic tools such as MorphoDiTa \(Straková et al\.,[2014](https://arxiv.org/html/2606.24366#bib.bib27); see Sect\.[6](https://arxiv.org/html/2606.24366#S6)\)\.
Table 2:Number of unique lemmas and <wordform, lemma, tag\> triplets in MorfFlex CZ 2\.1\.The architecture of MorfFlex CZ is specially designed for inflectional languages with a large number of endings \(suffixes\)\. We refer to this architecture — later applied to other languages with rich inflection \(see Sect\.[6](https://arxiv.org/html/2606.24366#S6)\) — simply as MorfFlex\.
In inflective languages, words take endings to mark linguistic cases, grammatical number, gender, tense, etc\. Therefore, many wordforms may be related to one lemma\. E\.g\., the Czech wordlétat\(‘to\-fly’\) can appear aslétám‘I\-am\-flying’,nelétám‘I\-am\-not\-flying’,333There is also prefixne\-‘not/ir\-/un\-’, which regularly forms the negative wordforms\.létáš‘you\-are \-flying’,letáme‘we\-are\-flying’,létejme‘let’s\-fly’,létalo‘it flew’,létali‘they flew’, etc\. – there are several tens of wordforms for this type of verb\. Corpus\-wise \(in the Prague Dependency Treebank – Consolidated 2\.0 release\), there are 230K unique wordforms and 97K lemmas in a corpus of almost 4M words \(cf\. Table[3](https://arxiv.org/html/2606.24366#S1.T3)\)\. It is therefore crucial to handle endings effectively and to reduce processing costs wherever regularities are found\. The dictionary is distributed as a plain unstructured list of the<wordform, lemma, tag\>triplets \(as shown in Table[1](https://arxiv.org/html/2606.24366#S1.T1)\)\. However, the manually maintained, unpublished source files and conversion scripts in fact represent a sophisticated system of inflectional and derivational patterns and rules which reduce the otherwise enormous size of the dictionary\. Today, the dictionary contains more than 100 million wordforms and more than 1 million lemmas \(cf\. Table[2](https://arxiv.org/html/2606.24366#S1.T2)\)\. We release the tool for expanding the source format into the basic <wordform, lemma, tag\> triples under an open\-source license at[https://github\.com/ufal/morfflex\-generator](https://github.com/ufal/morfflex-generator)\.
In the paper, we describe the Czech morphological dictionary MorfFlex CZ and we focus on presenting:
- •an effective method for handling rich morphological system of inflectional languages in the dictionary,
- •the usefulness of such a language resource for ensuring the consistency of corpus morphological annotation and for developing NLP applications\.
The paper is organized as follows\. In Sect\.[2](https://arxiv.org/html/2606.24366#S2), the related context of computational morphology is briefly outlined\. Sect\.[3](https://arxiv.org/html/2606.24366#S3)provides an introduction to the system of handling rich morphology in MorfFlex\. In Sect\.[4](https://arxiv.org/html/2606.24366#S4), the formats of the system is described in more detail: the unpublished source format \(Sect\.[4\.1](https://arxiv.org/html/2606.24366#S4.SS1)\), the intermediate format \(Sect\.[4\.2](https://arxiv.org/html/2606.24366#S4.SS2)\), and the basic format \(Sect\.[4\.3](https://arxiv.org/html/2606.24366#S4.SS3)\) in which the dictionary is distributed\. Sect\.[5](https://arxiv.org/html/2606.24366#S5)describes the procedure for converting from the source format to the intermediate format \(Sect\.[5\.1](https://arxiv.org/html/2606.24366#S5.SS1)\) and then to the basic format \(Sect\.[5\.2](https://arxiv.org/html/2606.24366#S5.SS2)\)\. The direct path is described in Sect\.[5\.3](https://arxiv.org/html/2606.24366#S5.SS3)\. The use of the dictionary is described in Sect\.[6](https://arxiv.org/html/2606.24366#S6)\. We conclude in Sect\.[7](https://arxiv.org/html/2606.24366#S7)\.
Table 3:Number of unique wordforms and lemmas in the Prague Dependency Treebank \- Consolidated 2\.0 release\.
## 2\. Related Work
In the context of NLP, morphology is primarily employed in the development of tools that recognize relationships between wordforms \(morphological analyzers, e\.g\.Voikko,444[https://voikko\.puimula\.org/](https://voikko.puimula.org/)HunmorphTrón et al\. \([2005](https://arxiv.org/html/2606.24366#bib.bib29)\)\) or between related words \(derivative tools, e\.g\.Sánchez Gutiérrez et al\.,[2017](https://arxiv.org/html/2606.24366#bib.bib28)\)\. Many of these tools were originally created for spellchecking purposes \(cf\. EnglishSCOWL\(Spell Checker Oriented Word Lists\)\)\.555[http://wordlist\.aspell\.net/dicts/](http://wordlist.aspell.net/dicts/)For languages with a rich inflectional structure, for which a comprehensive list would be impractical, a combination of lemmas/roots and inflections is used \(cf\.Collatinus666[https://outils\.biblissima\.fr/en/collatinus\-web/](https://outils.biblissima.fr/en/collatinus-web/)for Latin,Ajka777[https://nlp\.fi\.muni\.cz/projects/ajka/](https://nlp.fi.muni.cz/projects/ajka/)for Czech, orPolimorf888[https://zil\.ipipan\.waw\.pl/PoliMorf](https://zil.ipipan.waw.pl/PoliMorf)for Polish; for detailed description see e\.g\.Crane,[1991](https://arxiv.org/html/2606.24366#bib.bib7),Osolsobě,[1996](https://arxiv.org/html/2606.24366#bib.bib22),Paikens et al\.,[2024](https://arxiv.org/html/2606.24366#bib.bib23)\)\. Morphological dictionaries are often created simultaneously with the compilation of a corpus, using manual or semi\-manual annotation, for Slavic languages see e\.g\.Dobrovoljc et al\.,[2018](https://arxiv.org/html/2606.24366#bib.bib8)andLjubešić,[2019](https://arxiv.org/html/2606.24366#bib.bib18)\. Efforts are also being made to unify morphological features and create a consistent approach to morphology \(e\.g\.Unimorph999[https://unimorph\.github\.io](https://unimorph.github.io/)Batsuren et al\. \([2022](https://arxiv.org/html/2606.24366#bib.bib2)\);Paralex101010[https://www\.paralex\-standard\.org/](https://www.paralex-standard.org/)Beniamine et al\. \([2023](https://arxiv.org/html/2606.24366#bib.bib3)\)\)\.
## 3\. Morphological Dictionary MorfFlex
The MorfFlex dictionary is a list of <wordform, lemma, tag\> triplets\. The lemma111111Though lemma is often considered an abstract object, in MorfFlex it is always expressed as a human readable word\.is the basic form of the wordform \(usually such a form that is used as an entry in general dictionaries\)\. The tag codes morphological properties of the wordform\. The set of all wordforms with the same lemma is called a paradigm\. The lemma is usually viewed as a representative of the whole paradigm\. For example, the English wordformflybelongs to the paradigm represented by the lemmafly\. The whole paradigm associated with the lemmaflyis the set \{fly, flies, flew, flying, flown\}\. Czech paradigms are usually much larger because of the rich inflection of Czech\. The paradigm of the Czech equivalent of the Englishfly, namely the verblétathas 94 unique wordforms in the MorfFlex CZ dictionary \(including 30 rare archaic wordforms; cf\. Table[1](https://arxiv.org/html/2606.24366#S1.T1)\)\.
The set of triplets must comply with the so\-calledGolden rule of morphologyHlaváčová \([2017](https://arxiv.org/html/2606.24366#bib.bib16)\); Mikulová et al\. \([2020](https://arxiv.org/html/2606.24366#bib.bib21)\), which says that any particular pair <lemma, tag\> can appear only in one triplet throughout the entire dictionary\. In other words, it is not possible that a pair <lemma, tag\> can be used for the description of more than one wordform\. It guarantees that generating wordforms from <lemma, tag\> pairs is unambiguous\. We apply tag numbering to distinguish between different types of wordform variants \(cf\. standard wordformnelétají‘they\-are\-not\-flying’ and non\-standard wordformnelétaj‘they\-are\-not\-flying’ in Table[1](https://arxiv.org/html/2606.24366#S1.T1)\)\.
The triplets of the morphological dictionary can be used for generating as well as analyzing wordforms\. On the basis of the pair <lemma, tag\>, a single wordform is generated, as already explained\. The opposite task, analysis, assigns a set of pairs <lemma, tag\> to a given wordform\. In the latter case, there can be \(often are\) more such pairs, as the Czech language is massively homonymous\. E\.g\., the morphological analysis assigns two pairs <lemma, tag\> to the wordformlétaly\(cf\. Table[1](https://arxiv.org/html/2606.24366#S1.T1)\)\.
What we want to present here is that for Czech and other highly inflectional languages with a high degree of regularity \(in inflection and/or in derivation\), it is possible to describe paradigms using patterns – sets of rules for the automatic generation of entire paradigms \(in the form of triplets\)\. The usage of patterns substantially reduces the size of the dictionary\. It is also much more comprehensible for human maintainers\.
For various applications, the most convenient format of the morphological dictionary is the format of <wordform, lemma, tag\> triplets\. We call it thebasic format\. On the other hand, the most convenient format for storing and maintaining the dictionary is the format with patterns, so\-calledsource format\. Generating the basic format from the source format is not straightforward\. In most cases, every record of the source format is first transformed into a set of records of avirtual intermediate format, and subsequently into the basic format\. This transformation is carried out using rules that are triggered by patterns in both the source format and the intermediate format\. The whole procedure is presented as a schema in Fig\.[1](https://arxiv.org/html/2606.24366#S3.F1)\. The patterns in the source format arederivational\.121212An exception is formed by the so\-called trivial patterns \(see Sect\.[5\.3](https://arxiv.org/html/2606.24366#S5.SS3)\)\.They are used for:
1. 1\.creating derivations \(new words\), namely their lemmas
2. 2\.assigning inflectional patterns to every derived lemma
The patterns in the intermediate format areinflectional\. They are used for:
1. 1\.generating wordforms associated with lemma
2. 2\.assigning the tag describing the morphological properties of every wordform
It is thus not necessary to include some \(actually, many\) words in the dictionary; they are derived using regular patterns for large sets of words\. For instance, there is one record in the source format for the verblétat‘to fly’ \(cf\. the first row in Table[4](https://arxiv.org/html/2606.24366#S4.T4)\)\. Its derivational pattern \(ATN\) is “translated” into 20 records of the intermediate format \(cf\. sample in Table[5](https://arxiv.org/html/2606.24366#S4.T5)\)\. Here, an inflectional pattern is assigned to each of new derived lemmas\. This procedure enables not only to create the whole paradigm of the originating verblétat‘to fly’ itself, but also paradigms of the derived words such aslétávat\(‘to use to fly’\),létání\(‘flying’ – noun\),létající\(‘flying’ – adjective\), etc\. Thus, from the single line in the source format of the verblétat‘to fly’, 3,096 triplets of the basic format are automatically generated \(cf\. sample in Table[1](https://arxiv.org/html/2606.24366#S1.T1)\)\.
Figure 1:MorfFlex scheme\. There are three formats: source, intermediate and basic\. In thesource format, two types of records are used\. The upper type contains a derivational pattern\. According to specific rules for derivational patterns, this record is transformed into one or more records of theintermediate virtual format, schematically drawn in the middle column\. The derivational pattern is replaced by a set of inflectional patterns, the new roots are created and also the lemma is changed\. Each of the records from the intermediate format generates a set of <wordform, lemma, tag\> triplets – they are shown in the rightmost part of the scheme, the finalbasic format\. The lower part of the scheme shows the simplified procedure for trivial patterns\. In that case, the record in the intermediate format would be the same as the source one, which means skipping the intermediate format\. The italics font throughout the whole scheme represents the final items \(strings\), as they are being created while producing the basic format triplets\.
## 4\. MorfFlex Formats
In this section, all three MorfFlex formats are described: source \(Sect\.[4\.1](https://arxiv.org/html/2606.24366#S4.SS1)\), intermediate \(Sect\.[4\.2](https://arxiv.org/html/2606.24366#S4.SS2)\) and basic \(Sect\.[4\.3](https://arxiv.org/html/2606.24366#S4.SS3)\)\.
### 4\.1\. Source format
The source format of MorfFlex is stored in a text format in UTF\-8 coding\. It enables very easy maintenance in any text editor\. Every source format record of the dictionary contains the following pieces of information \(prefix d\- is for ”derivational”\):
- •d\-root: beginning of a wordform that does not change within the d\-pattern\. It does not need to be a grammatical root of a word\. The d\-root always includes prefixes that appear at the beginning of the wordform, except for the regular negation prefixne\-and the superlative prefixnej\-\.
- •d\-pattern: derivational pattern \(new lemmas are derived according to the rules stored in this pattern\)\. If it equals to0or0n, we call ittrivialpattern and in that case, there must be at least one item TAGs present in the dictionary source line\. In this paper, the d\-patterns are written in capital letters, contrary to the inflectional patterns\.
- •d\-lemma: originating lemma\. This item can be enriched with several types of information concerning variants, style, or just an explanation for maintainers\. They do not have any influence on the procedure of generating the wordforms and their tags\. In case of homonymy, the lemmas are numbered to represent different paradigms, e\.g\.,jak\-1is for noun \(‘yak’ \- animal\),jak\-2is for conjunction \(‘as’\),jak\-3is for adverb \(‘how’\)\. See Table[4](https://arxiv.org/html/2606.24366#S4.T4)\.
- •TAGs: morphological tags assigned to the wordform\. It is present only when d\-pattern is trivial \(e\.i\., equals0or0n\)\.
Many paradigms are described by more records in the source format\. In that case, the d\-lemma is identical in all records belonging to that paradigm\. Typical examples are verbs with irregular conjugation\. Also, irregular or atypical inflection of nouns is described using several records, usually by adding a record with the trivial pattern \(cf\. record ofocet‘vinegar’ in Table[4](https://arxiv.org/html/2606.24366#S4.T4)\)\. Examples of the source format records are shown in Table[4](https://arxiv.org/html/2606.24366#S4.T4)\.
Table 4:Record examples in the source format\. The first row is examples of record verblétat‘to\-fly’ with derivational pattern\. The following 3 rows show records of the homonymous lemmajak\- one with regular d\-pattern; the rest shows the lemmaocet’vinegar’ with regular d\-pattern and two more trivial patterns0for the description of two irregular wordforms, not covered by the d\-pattern\.
### 4\.2\. Intermediate format
The intermediate format is only virtual\. It represents a transitional stage in the overall process of generating the final triplets, from the source format to the basic one\. The intermediate format records has the same items \(except TAGs\) as the records in the source format, but their content is different: instead of derivational patterns, there are inflectional patterns\. Thus, a virtual intermediate record contains the following pieces of information \(prefix i\- is for ”inflectional”\):
- •i\-root: beginning of a wordform that does not change during the inflection within the i\-pattern \(used for the creation of the whole paradigm for i\-lemma\)
- •i\-pattern: inflectional pattern\.
- •i\-lemma: inflectional lemma – the final lemma that appears in the basic format\. The newly derived i\-lemmas include a so\-called\_backlinkto the originating lemma\.
Examples of intermediate format records are shown in Table[5](https://arxiv.org/html/2606.24366#S4.T5)\.
Table 5:Examples of severalrecords in the intermediate formatderived from the source format record of the d\-lemmalétat‘to fly’ \(cf\. the first row in Table[4](https://arxiv.org/html/2606.24366#S4.T4)\)\.
### 4\.3\. Basic format
The basic format of MorfFlex is a plain unstructured list of <wordform, lemma, tag\> triplets:
- •wordform
- •lemma\- representative wordform\. Lemma is unique for the whole paradigm, i\.e\. a set of wordforms grouped according to inflectional behaviour\.
- •tag\- In the positional tag, full inflectional information is coded for the wordform\. An overview and explanation of the tag positions are provided in Table[6](https://arxiv.org/html/2606.24366#S4.T6)\. Table 6:Attributes in positional tags
An example of the final basic format is shown in Table[1](https://arxiv.org/html/2606.24366#S1.T1)\. It displays the final triplets generated to the basic format based on the intermediate record shown in the first row of Table[5](https://arxiv.org/html/2606.24366#S4.T5)\.
## 5\. MorfFlex Procedure
The procedure for creating a dictionary from the source format is not straightforward; it is carried out in two steps\. First, the source format is converted into a virtual intermediate format \(Sect\.[5\.1](https://arxiv.org/html/2606.24366#S5.SS1)\), from which the basic format is subsequently generated \(Sect\.[5\.2](https://arxiv.org/html/2606.24366#S5.SS2)\)\. Those source format records, that contain a trivial pattern \(and therefore skip over the intermediate format\) are used for generation the triplets of the basic format directly \(Sect\.[5\.3](https://arxiv.org/html/2606.24366#S5.SS3)\)\.
### 5\.1\. From source to intermediate format
In this section, we describe the procedure of the transformation from the source format \(described in Sect\.[4\.1](https://arxiv.org/html/2606.24366#S4.SS1)\) into the intermediate format \(Sect\.[4\.2](https://arxiv.org/html/2606.24366#S4.SS2)\)\.
Every source format record \(except those with trivial patterns\) is transformed into one or more intermediate records according to the set of special derivational rules\. This means that the d\-root in the source format can be different from the i\-root\(s\) of the intermediate format, and the d\-lemma is changed into a possibly different i\-lemma\(s\)\. For each new i\-root and i\-lemma, an inflectional i\-pattern is set\.
There are one or more derivational rules for each d\-pattern\. Every derivational rule is represented by a single line that codes the actions to be done with a record of the source format to create the record\(s\) in the intermediate format\. This is a slightly simplified format of a derivational rule:
d\-pattern r1,r2,r3,r4,r5,r6
The explanation is as follows:
- •d\-patternis a derivational pattern from the source format of the dictionary\.
- •r1is a string that is to be concatenated with the d\-root\. The result is a new root \(i\-root\)\. 0 \(zero\) means an empty string\.
- •r2is an inflectional pattern that is to be used to generate the paradigm of the newlemma\.
- •r3codes the first step of the construction of the new derived lemma \(i\-lemma\)\.
- •r4is the ending of the new derived lemma\.
- •r5andr6relate to a backlink to the originating d\-lemma from which the new i\-lemma was created — a derivational link\. The derivational link then becomes part of the new lemma, in a form that is rather a note to hint which derivation took place\.
The entire procedure of using derivational rules is demonstrated in detail on an example in Sect\.[5\.1\.1](https://arxiv.org/html/2606.24366#S5.SS1.SSS1)\.
#### 5\.1\.1\. Example of using a derivational rule
For the demonstration of the derivational rules, we chose the exampledárce‘donor’ with the derivational patternSC1\. The source format record is shown in Table[7](https://arxiv.org/html/2606.24366#S5.T7)\.
d\-rootd\-patternd\-lemmaTAGsdárSC1dárceTable 7:Source format record for d\-lemmadárce‘donor’\.The derivational rules for the \(derivational\) d\-patternSC1are in Table[8](https://arxiv.org/html/2606.24366#S5.T8)\. Each of the four lines in Table[8](https://arxiv.org/html/2606.24366#S5.T8)is used to derive one record of the intermediate format as presented in Table[9](https://arxiv.org/html/2606.24366#S5.T9)\. It summarizes all the derived lemmas and their inflectional patterns that will generate the paradigm triplets in the basic format\. The rows in Table[9](https://arxiv.org/html/2606.24366#S5.T9)are numbered following the order of the derivational rules in Table[8](https://arxiv.org/html/2606.24366#S5.T8)\.
Table 8:Derivational rules for theSC1d\-pattern\.Table 9:Records of the intermediate format derived from the source format record presented in Table[7](https://arxiv.org/html/2606.24366#S5.T7)– according to derivational rules presented in Table[8](https://arxiv.org/html/2606.24366#S5.T8)\. The meaning of the i\-lemmas are: 1 donor, 2 belonging to donor, 3 donor\-woman, 4 belonging to donor\-woman\.Based on a single line record in the source format \(Table[7](https://arxiv.org/html/2606.24366#S5.T7)\), four different virtual intermediate records \(Table[9](https://arxiv.org/html/2606.24366#S5.T9)\) are created according to derivational rules presented in Table[8](https://arxiv.org/html/2606.24366#S5.T8)\. Assigned inflectional i\-patterns are to be used in the next step \(cf\. Sect\.[5\.2\.1](https://arxiv.org/html/2606.24366#S5.SS2.SSS1)\) for the generation of 250 different triplets in the basic format of the dictionary\.
### 5\.2\. From intermediate to basic format
The intermediate format of the dictionary \(Sect\.[4\.2](https://arxiv.org/html/2606.24366#S4.SS2)\) is transformed into the basic format \(Sect\.[4\.3](https://arxiv.org/html/2606.24366#S4.SS3)\) by means of inflectional i\-patterns that appear in records of the intermediate format\. The inflectional i\-patterns contain a prescription for generating the wordforms and tags of the basic format triplets, while the lemma is simply copied from the intermediate virtual record\.
Every i\-pattern consists of a set of pairs <ending, tag\>, associating the ending \(it is simply a string to be concatenated with the i\-root\) with a particular set of grammatical morphological categories represented by the tag\.
The entire procedure of using an inflectional rule is demonstrated on an example in the next Sect\.[5\.2\.1](https://arxiv.org/html/2606.24366#S5.SS2.SSS1)\.
#### 5\.2\.1\. Example of using an inflectional rule
To demonstrate the usage of inflectional rules, we chose the intermediate record for the lemmadárce‘donor’ with the i\-patternsc1\. The intermediate format record is shown in Table[11](https://arxiv.org/html/2606.24366#S5.T11)\(cf\. also the first line of Table[9](https://arxiv.org/html/2606.24366#S5.T9)\)\.
Table 10:Example of the inflectionalsc1i\-pattern, according to which, for example, the paradigm of the noundárce‘donor’ is generated\.i\-rooti\-patterni\-lemmadársc1dárceTable 11:Intermediate record for i\-lemmadárce‘donor’\.The inflectional rules associated with i\-patternsc1are demonstrated in Table[10](https://arxiv.org/html/2606.24366#S5.T10)\. These rules combine the i\-rootdárwith the endings \(end\) and tags in the Table[10](https://arxiv.org/html/2606.24366#S5.T10), and generate the resulting basic format triplets \(a sample is shown in Table[12](https://arxiv.org/html/2606.24366#S5.T12)\)\.
Table 12:A shortened sample of <wordform, lemma, tag\> triplets for the paradigmdárce‘donor’\.
### 5\.3\. Trivial patterns
In the case of trivial patterns, we can interpret the whole procedure as skipping over the intermediate format and generating the basic format triplets directly from the source format \(see Fig\.[1](https://arxiv.org/html/2606.24366#S3.F1)\)\. There are only two trivial patterns,0and0n, the latter differs from the former by its ability to generate also a negative wordform, which is formed by placing the prefixne\-‘non/ir/un’ at the beginning of the wordform\. No other changes to the root are induced by trivial patterns\. Every source record with a trivial pattern must contain the item TAGs, that consists of a set of morphological tags\. For each tag, one triplet is generated directly in the basic format\. The wordform equals the d\-root and the lemma equals the d\-lemma\.
The trivial patterns are mostly used in case of fully or partially irregular paradigms, but also for uninflected forms\. For fully irregular paradigms, the whole paradigm has to be described as a set of records with the trivial patterns, form by form\. Partially irregular paradigms are described by means of non\-trivial pattern\(s\) plus additional record\(s\) with a trivial pattern \(cf\. e\.g\. source format records for the d\-lemmaocet‘vinegar’ in Table[4](https://arxiv.org/html/2606.24366#S4.T4)\)\.
The procedure with trivial patterns is demonstrated on an example in the next Sect\.[5\.3\.1](https://arxiv.org/html/2606.24366#S5.SS3.SSS1)\.
#### 5\.3\.1\. Examples of using a trivial pattern
To demonstrate the direct procedure with trivial patterns from source format to basic format, we chose the wordsdnes‘today’ andzřídka‘rarely’\. The former one has the trivial pattern0, the latter one0nenabling to create both affirmative and negative wordforms\. See source format records in Table[13](https://arxiv.org/html/2606.24366#S5.T13)\. The resulting triplets of the these records are in Table[14](https://arxiv.org/html/2606.24366#S5.T14)\.
Table 13:Source format records of adverbsdnes‘today’ andzřídka‘rarely’ with trivial patterns\. The placeholder @ in the second record tag will be replaced withAfor the affirmative andNfor the negative wordform, see the following Table[14](https://arxiv.org/html/2606.24366#S5.T14)\.Table 14:The resulting <wordform, lemma, tag\> basic format triplets for the paradigmdnes‘today’ andzřídka‘rarely’ of the trivial source format records presented in Table[13](https://arxiv.org/html/2606.24366#S5.T13)\.
## 6\. Use of MorfFlex
Morphological analysis, part\-of\-speech tagging are important components of NLP applications\. They usually represent initial steps of language processing\. Despite recent advances in part\-of\-speech and morphological tagging, the old truth that more data always gives better resultsBanko and Brill \([2001](https://arxiv.org/html/2606.24366#bib.bib1)\); Church and Mercer \([1993](https://arxiv.org/html/2606.24366#bib.bib5)\)still holds\. At the same time, consistency in data annotation is a very important factor\. For morphological annotation, especially for morphologically rich languages with thousands of possible combinations of morphological values, consistency can only be achieved when a dictionary lists all plausible morphological interpretations of all wordforms \(cf\.Hajič,[2000](https://arxiv.org/html/2606.24366#bib.bib13)\)\.
The morphological dictionary MorfFlex CZ has from the very beginning been an essential part of the manual morphological annotation in the Prague Dependency Treebanks \(the latest release isPrague Dependency Treebank – Consolidated 2\.0\(Hajič et al\.,[2024a](https://arxiv.org/html/2606.24366#biba.bib1);Mikulová et al\.,[2026](https://arxiv.org/html/2606.24366#bib.bib20)\), containing almost 4M tokens manually annotated for morphology\)\.131313For details see a specification of the Czech morphological annotation\(Mikulová et al\.,[2020](https://arxiv.org/html/2606.24366#bib.bib21)\)\.As the volume of manually annotated data increases, the dictionary is naturally being expanded and enriched as well\. The aim of all annotation projects is to achieve full consistency not only between the data and the dictionary, but also also within both the data and the dictionary themselves \(cf\.Hajič et al\.,[2020](https://arxiv.org/html/2606.24366#bib.bib10); Hlaváčová et al\.,[2019](https://arxiv.org/html/2606.24366#bib.bib17)\)\.
The PDT\-C corpus and MorfFlex CZ dictionary are used for building models forMorphoDiTa\(Morphological Dictionary and Tagger\),141414[https://lindat\.mff\.cuni\.cz/services/morphodita](https://lindat.mff.cuni.cz/services/morphodita)an open\-source tool for morphological analysis, generation, tokenization, lemmatization and tagging of texts\. It performs morphological analysis and generation using the MorfFlex CZ dictionary\. The MorphoDiTa tool achieves state\-of\-the\-art results with a throughput around 10\-200K words per second\. It has, on average across datasets in PDT\-C, F1\-scores of 96\.27% for tagging and 98\.31% for lemmatizationStraka and Straková \([2024](https://arxiv.org/html/2606.24366#bib.bib26)\)\. MorphoDiTa is used for automatic POS and morphological annotation of all the Czech corpora available in the Kontext KWIC tool at the LINDAT/CLARIAH\-CZ research infrastructure\.151515[https://lindat\.cz](https://lindat.cz/)MorphoDiTa is also one of the two components used for morphological disambiguation of the Czech National Corpus161616[https://www\.korpus\.cz](https://www.korpus.cz/)Petkevič and Jelínek \([2025](https://arxiv.org/html/2606.24366#bib.bib24)\)\.
On the manual annotation of PDT\-C data, theCzech UDPipe model, is trained\. UDPipe171717[https://lindat\.mff\.cuni\.cz/services/udpipe](https://lindat.mff.cuni.cz/services/udpipe)is a pipeline for tokenization, tagging, lemmatization and dependency parsing\. UDPipe took part in several competitions, reaching excellent results in all of themZeman et al\. \([2018](https://arxiv.org/html/2606.24366#bib.bib31)\); McCarthy et al\. \([2019](https://arxiv.org/html/2606.24366#bib.bib19)\); Sprugnoli et al\. \([2020](https://arxiv.org/html/2606.24366#bib.bib25)\)\. InStraka and Straková \([2024](https://arxiv.org/html/2606.24366#bib.bib26)\), the authors show that a model, which combines the deep learning architecture of UDPipe with rescoring by the morphological dictionary MorfFlex \(the core of MorphoDiTa\), improves over both a deep learning system and a dictionary\-based system on their own\.
MorfFlex CZ is also aligned with the set of 1,040,126 lexemes contained inDeriNetVidra et al\. \([2019](https://arxiv.org/html/2606.24366#bib.bib30)\), a lexical network which models word\-formation relations in the lexicon of Czech\.
The MorfFlex framework can also be used for other languages with rich morphology\. In addition to Czech, there is also a version for Slovak calledMorfFlex SKHajič and Hric \([2017](https://arxiv.org/html/2606.24366#bib.bib12)\)\. This dictionary is also used in MorphoDiTa for the Slovak language\.
## 7\. Conclusion
In conclusion, MorfFlex exemplifies how a systematically organized morphological dictionary can effectively manage the complexity of a highly inflective language such as Czech\. By combining manually maintained source files with derivational and inflectional patterns, the dictionary achieves a remarkable balance between comprehensiveness and efficiency: from 450K rows in the source format, it produces over 100 million wordforms and more than 1 million lemmas\. MorfFlex CZ ensures consistency in manual morphological annotation within the Prague Dependency Treebanks and also provides a robust foundation for state\-of\-the\-art NLP tools like MorphoDiTa\.
This contribution highlights the continued relevance of formalized linguistic resources in the era of advanced computational methods\. MorfFlex demonstrates that even in highly inflectional languages, careful linguistic modelling can substantially reduce the size of a dictionary while ensuring it remains usable for human annotators and automated systems\. Ultimately, this resource underscores the ongoing importance of combining linguistic expertise with computational techniques to advance corpus annotation, NLP development, and the broader study of morphologically rich languages\.
## 8\. Acknowledgements
The research and language resource work reported in the paper has been supported by the LINDAT/CLARIN and LINDAT/CLARIAH\-CZ projects funded by Ministry of Education, Youth and Sports of the Czech Republic \(projects LM2015071, LM2018101, LM2023062\)\. The original annotation has been supported by multiple projects in the past, funded both nationally by the Ministry of Education, Youth and Sports of the Czech Republic and the Czech Science Foundation\.
## 9\. Limitations
The presented pattern\-based management of rich morphological systems is limited to inflectional languages \(primarily Slavic ones\)\. For languages of an agglutinative type, it would need to be adapted\. On the one hand, the pattern\-based system makes it possible to efficiently generate a huge number of lemmas and wordforms; on the other hand, even a minor change in one pattern can lead to numerous changes in the resulting dictionary, which can be difficult to monitor\.
## 10\. Bibliographical References
\\c@NAT@ctr
- Banko and Brill \(2001\)Michele Banko and Eric Brill\. 2001\.[Scaling to Very Very Large Corpora for Natural Language Disambiguation](https://doi.org/10.3115/1073012.1073017)\.In*Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics*, pages 26–33, Toulouse, France\. Association for Computational Linguistics\.
- Batsuren et al\. \(2022\)Khuyagbaatar Batsuren, Omer Goldman, Salam Khalifa, Nizar Habash, Witold Kieraś, Gábor Bella, Brian Leonard, Garrett Nicolai, Kyle Gorman, Yustinus Ghanggo Ate, Maria Ryskina, Sabrina Mielke, Elena Budianskaya, Charbel El\-Khaissi, Tiago Pimentel, Michael Gasser, William Abbott Lane, Mohit Raj, Matt Coler, Jaime Rafael Montoya Samame, Delio Siticonatzi Camaiteri, Esaú Zumaeta Rojas, Didier López Francis, Arturo Oncevay, Juan López Bautista, Gema Celeste Silva Villegas, Lucas Torroba Hennigen, Adam Ek, David Guriel, Peter Dirix, Jean\-Philippe Bernardy, Andrey Scherbakov, Aziyana Bayyr\-ool, Antonios Anastasopoulos, Roberto Zariquiey, Karina Sheifer, Sofya Ganieva, Hilaria Cruz, Ritván Karahóǧa, Stella Markantonatou, George Pavlidis, Matvey Plugaryov, Elena Klyachko, Ali Salehi, Candy Angulo, Jatayu Baxi, Andrew Krizhanovsky, Natalia Krizhanovskaya, Elizabeth Salesky, Clara Vania, Sardana Ivanova, Jennifer White, Rowan Hall Maudslay, Josef Valvoda, Ran Zmigrod, Paula Czarnowska, Irene Nikkarinen, Aelita Salchak, Brijesh Bhatt, Christopher Straughn, Zoey Liu, Jonathan North Washington, Yuval Pinter, Duygu Ataman, Marcin Wolinski, Totok Suhardijanto, Anna Yablonskaya, Niklas Stoehr, Hossep Dolatian, Zahroh Nuriah, Shyam Ratan, Francis M\. Tyers, Edoardo M\. Ponti, Grant Aiton, Aryaman Arora, Richard J\. Hatcher, Ritesh Kumar, Jeremiah Young, Daria Rodionova, Anastasia Yemelina, Taras Andrushko, Igor Marchenko, Polina Mashkovtseva, Alexandra Serova, Emily Prud’hommeaux, Maria Nepomniashchaya, Fausto Giunchiglia, Eleanor Chodroff, Mans Hulden, Miikka Silfverberg, Arya D\. McCarthy, David Yarowsky, Ryan Cotterell, Reut Tsarfaty, and Ekaterina Vylomova\. 2022\.[UniMorph 4\.0: Universal Morphology](https://aclanthology.org/2022.lrec-1.89/)\.In*Proceedings of the Thirteenth Language Resources and Evaluation Conference*, pages 840–855, Marseille, France\. European Language Resources Association\.
- Beniamine et al\. \(2023\)Sacha Beniamine, Cormac Anderson, Mae Carroll, Matías Guzmán Naranjo, Borja Herce, Matteo Pellegrini, Erich Round, Helen Sims\-Williams, and Tiago Tresoldi\. 2023\.[Paralex: a DeAR standard for rich lexicons of inflected forms](https://ismo2023.ovh/fichiers/abstracts/4_ISMO_2023_Paralex.pdf)\.Presentation at International Symposium of Morphology\.
- Charniak \(2000\)Eugene Charniak\. 2000\.[A Maximum\-Entropy\-Inspired Parser](https://aclanthology.org/A00-2018/)\.In*1st Meeting of the North American Chapter of the Association for Computational Linguistics*\.
- Church and Mercer \(1993\)Kenneth W\. Church and Robert L\. Mercer\. 1993\.[Introduction to the Special Issue on Computational Linguistics Using Large Corpora](https://aclanthology.org/J93-1001/)\.*Computational Linguistics*, 19\(1\):1–24\.
- Collins et al\. \(1999\)Michael Collins, Jan Hajič, Lance Ramshaw, and Christoph Tillmann\. 1999\.[A Statistical Parser for Czech](https://doi.org/10.3115/1034678.1034754)\.In*Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics*, pages 505–512, College Park, Maryland, USA\. Association for Computational Linguistics\.
- Crane \(1991\)Gregory Crane\. 1991\.[Generating and Parsing Classical Greek](https://doi.org/10.1093/llc/6.4.243)\.*Literary and Linguistic Computing*, 6\(4\):243–245\.
- Dobrovoljc et al\. \(2018\)Kaja Dobrovoljc, Simon Krek, and Tomaž Erjavec\. 2018\.[The Sloleks Morphological Lexicon and its Future Development](https://doi.org/10.4312/9789612379131)\.In Vojko Gorjanc, Polona Gantar, Iztok Kosem, and Simon Krek, editors,*Dictionary of Modern Slovene: Problems and Solutions*, pages 42–63\. Založba Univerze v Ljubljani, Ljubljana, SI\.
- Hajič \(2004\)Jan Hajič\. 2004\.*Disambiguation of Rich Inflection \(Computational Morphology of Czech\)*\.Karolinum, Prague, Czech Republic\.
- Hajič et al\. \(2020\)Jan Hajič, Eduard Bejček, Jaroslava Hlavacova, Marie Mikulová, Milan Straka, Jan Štěpánek, and Barbora Štěpánková\. 2020\.[Prague Dependency Treebank \- Consolidated 1\.0](https://aclanthology.org/2020.lrec-1.641/)\.In*Proceedings of the Twelfth Language Resources and Evaluation Conference*, pages 5208–5218, Marseille, France\. European Language Resources Association\.
- Hajič et al\. \(2001\)Jan Hajič, Barbora Vidová Hladká, Jarmila Panevová, Eva Hajičová, Petr Sgall, and Petr Pajas\. 2001\.[Prague Dependency Treebank 1\.0](https://catalog.ldc.upenn.edu/LDC2001T10)\.LDC2001T10, Linguistic Data Consortium, Philadelphia, PA, USA\.
- Hajič and Hric \(2017\)Jan Hajič and Jan Hric\. 2017\.[MorfFlex SK 170914](http://hdl.handle.net/11234/1-3277)\.LINDAT/CLARIAH\-CZ digital library at the Institute of Formal and Applied Linguistics \(ÚFAL\)\.
- Hajič \(2000\)Jan Hajič\. 2000\.[Morphological Tagging: Data vs\. Dictionaries](https://aclanthology.org/A00-2013/)\.In*1st Meeting of the North American Chapter of the Association for Computational Linguistics*\.
- Hajič et al\. \(1998\)Jan Hajič, Eric Brill, Michael Collins, Barbora Hladká, Dale Jones, Ching Kuo, Lance Ramshaw, Owen Schwartz, Christoph Tillmann, and Daniel Zeman\. 1998\.[Core Natural Language Processing Technology Applicable to Multiple Languages](https://ufal.mff.cuni.cz/pdt1/Corpora/PDT_1.0/Doc/ws98/index.html)\.Technical report, Center for Language and Speech Processing, Johns Hopkins University, Baltimore, Maryland\.
- Hajič and Drozd \(1990\)Jan Hajič and Janus Drozd\. 1990\.[Spelling\-checking for Highly Inflective Languages](https://aclanthology.org/C90-3072/)\.In*COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics*\.
- Hlaváčová \(2017\)Jaroslava Hlaváčová\. 2017\.Golden Rule of Morphology and Variants of Wordforms\.*Jazykovedný časopis / Journal of Linguistics*, 68\(2\):136–144\.
- Hlaváčová et al\. \(2019\)Jaroslava Hlaváčová, Marie Mikulová, Barbora Štěpánková, and Jan Hajič\. 2019\.Modifications of the Czech morphological dictionary for consistent corpus annotation\.*Jazykovedný časopis / Journal of Linguistics*, 70\(2\):380–389\.
- Ljubešić \(2019\)Nikola Ljubešić\. 2019\.[Inflectional lexicon hrLex 1\.3](http://hdl.handle.net/11356/1232)\.Slovenian language resource repository CLARIN\.SI\.
- McCarthy et al\. \(2019\)Arya D\. McCarthy, Ekaterina Vylomova, Shijie Wu, Chaitanya Malaviya, Lawrence Wolf\-Sonkin, Garrett Nicolai, Christo Kirov, Miikka Silfverberg, Sabrina J\. Mielke, Jeffrey Heinz, Ryan Cotterell, and Mans Hulden\. 2019\.[The SIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross\-Lingual Transfer for Inflection](https://doi.org/10.18653/v1/W19-4226)\.In*Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology*, pages 229–244, Florence, Italy\. Association for Computational Linguistics\.
- Mikulová et al\. \(2026\)Marie Mikulová, Jiří Mírovský, Milan Straka, Pavlína Synková, Barbora Štěpánková, Jan Štěpánek, and Jan Hajič\. 2026\.Prague Dependency Treebank \- Consolidated 2\.0: Enriching a Complex Annotation Scheme\.In*Proceedings of the Fifteenth Language Resources and Evaluation Conference*, Palma de Mallorca, Spain\. European Language Resources Association\.
- Mikulová et al\. \(2020\)Marie Mikulová, Jaroslava Hlaváčová, Jan Hajič, Jiří Hana, Hana Hanová, Barbora Hladká, Barbora Štěpánková, and Daniel Zeman\. 2020\.[Manual for Morphological Annotation, Revision for the Prague Dependency Treebank \- Consolidated 1\.0](https://ufal.mff.cuni.cz/pdt-c/publications/TR_PDT_C_morph_manual.pdf)\.Technical Report TR\-2020\-64, Institute of Formal and Applied Linguistics, Charles University, Prague, Czech Republic\.
- Osolsobě \(1996\)Klára Osolsobě\. 1996\.*Algoritmický popis české formální morfologie a strojový slovník češtiny*\.Disertační práce, Masarykova univerzita, Brno\.
- Paikens et al\. \(2024\)Peteris Paikens, Lauma Pretkalniņa, and Laura Rituma\. 2024\.[A Computational Model of Latvian Morphology](https://aclanthology.org/2024.lrec-main.20/)\.In*Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation*, pages 221–232, Torino, Italia\. ELRA and ICCL\.
- Petkevič and Jelínek \(2025\)Vladimír Petkevič and Tomáš Jelínek\. 2025\.[Automatická morfologická disambiguace korpusů řady syn: spolupráce lingvistické introspekce a strojového učení](https://doi.org/10.58756/n11082501)\.*Naše řeč*, 108\(1\):3–40\.
- Sprugnoli et al\. \(2020\)Rachele Sprugnoli, Marco Passarotti, Flavio Massimiliano Cecchini, and Matteo Pellegrini\. 2020\.[Overview of the EvaLatin 2020 Evaluation Campaign](https://aclanthology.org/2020.lt4hala-1.16/)\.In*Proceedings of 1st Workshop on Language Technologies for Historical and Ancient Languages*, pages 105–110, Marseille, France\. European Language Resources Association\.
- Straka and Straková \(2024\)Milan Straka and Jana Straková\. 2024\.Open\-Source Web Service with Morphological Dictionary–Supplemented Deep Learning for Morphosyntactic Analysis of Czech\.In*27th International Conference on Text, Speech and Dialogue*, pages 279–290, Cham, Switzerland\. Masaryk University, Springer\.
- Straková et al\. \(2014\)Jana Straková, Milan Straka, and Jan Hajič\. 2014\.[Open\-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition](http://www.aclweb.org/anthology/P/P14/P14-5003.pdf)\.In*Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations*, pages 13–18, Baltimore, Maryland\. Association for Computational Linguistics\.
- Sánchez Gutiérrez et al\. \(2017\)Claudia Sánchez Gutiérrez, Hugo Mailhot, Hélène Deacon, and Maximiliano Wilson\. 2017\.[MorphoLex: A derivational morphological database for 70,000 English words](https://doi.org/10.3758/s13428-017-0981-8)\.*Behavior Research Methods*, http://link\.springer\.com/article/10\.3758/s13428\-017\-0981\-8:1–13\.
- Trón et al\. \(2005\)Viktor Trón, Gyögy Gyepesi, Péter Halácsky, András Kornai, László Németh, and Dániel Varga\. 2005\.[Hunmorph: Open Source Word Analysis](https://aclanthology.org/W05-1106/)\.In*Proceedings of Workshop on Software*, pages 77–85, Ann Arbor, Michigan\. Association for Computational Linguistics\.
- Vidra et al\. \(2019\)Jonáš Vidra, Zdeněk Žabokrtský, Magda Ševčíková, and Lukáš Kyjánek\. 2019\.[DeriNet 2\.0: Towards an All\-in\-One Word\-Formation Resource](https://aclanthology.org/W19-8510/)\.In*Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology*, pages 81–89, Prague, Czechia\. Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics\.
- Zeman et al\. \(2018\)Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov\. 2018\.[CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies](https://doi.org/10.18653/v1/K18-2001)\.In*Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies*, pages 1–21, Brussels, Belgium\. Association for Computational Linguistics\.
## 11\. Language Resource References
\\c@NAT@ctr
- Hajič et al\. \(2024a\)Hajič, Jan and Bejček, Eduard and Bémová, Alevtina and Buráňová, Eva and Fučíková, Eva and Hajičová, Eva and Havelka, Jiří and Hlaváčová, Jaroslava and Homola, Petr and Ircing, Pavel and Kárník, Jiří and Kettnerová, Václava and Klyueva, Natalia and Kolářová, Veronika and Kučová, Lucie and Lopatková, Markéta and Mareček, David and Mikulová, Marie and Mírovský, Jiří and Nedoluzhko, Anna and Novák, Michal and Pajas, Petr and Panevová, Jarmila and Peterek, Nino and Poláková, Lucie and Popel, Martin and Popelka, Jan and Romportl, Jan and Rysová, Magdaléna and Semecký, Jiří and Sgall, Petr and Spoustová, Johanka and Straka, Milan and Straňák, Pavel and Synková, Pavlína and Ševčíková, Magda and Šindlerová, Jana and Štěpánek, Jan and Štěpánková, Barbora and Toman, Josef and Urešová, Zdeňka and Vidová Hladká, Barbora and Zeman, Daniel and Zikánová, Šárka and Žabokrtský, Zdeněk\. 2024a\.[*Prague Dependency Treebank \- Consolidated 2\.0 \(PDT\-C 2\.0\)*](http://hdl.handle.net/11234/1-5813)\.LINDAT/CLARIAH\-CZ digital library at the Institute of Formal and Applied Linguistics \(ÚFAL\), Charles University, Prague, Czech republic\.PID[http://hdl\.handle\.net/11234/1\-5813](http://hdl.handle.net/11234/1-5813)\.
- Hajič et al\. \(2024b\)Hajič, Jan and Hlaváčová, Jaroslava and Marie Mikulová and Milan Straka and Barbora Štěpánková\. 2024b\.[*MorfFlex CZ 2\.1*](http://hdl.handle.net/11234/1-5833)\.LINDAT/CLARIAH\-CZ digital library at the Institute of Formal and Applied Linguistics \(ÚFAL\), Charles University, Prague, Czech republic\.PID[http://hdl\.handle\.net/11234/1\-5833](http://hdl.handle.net/11234/1-5833)\.Similar Articles
Leveraging Morphology for Historical Script Metrological Analysis
This paper presents a transformer-based architecture with prototype learning that enables scalable paleographic measurements from historical documents using only line-level transcriptions, demonstrating effectiveness on a 160-page codex with minimal training data.
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
Researchers introduce MORPHOGEN, a multilingual benchmark testing LLMs’ ability to rewrite first-person sentences in the opposite gender while preserving meaning across French, Arabic, and Hindi.
Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
This paper presents Morpheus, a neural tokenizer and word embedder for Turkish that learns morpheme boundaries without string normalization, achieving lossless tokenization and competitive embeddings for lexical retrieval, while using less GPU memory than subword tokenizers.
A Modular Architecture for Typologically Controlled Lexicon Generation
This paper presents a modular framework for generating artificial lexicons that are pronounceable, typologically plausible, and semantically structured, using phoneme inventories from PHOIBLE and probabilistic grammars, outperforming deterministic baselines.
Prague Dependency Treebank -- Consolidated 2.0: Enriching a Complex Annotation Scheme
We present the second consolidated version of the Prague Dependency Treebank, a 4-million-token manual multilingual annotation resource covering morphology, syntax, semantics, coreference, and discourse, along with compatible lexicons.