Concordance Comparison as a Means of Assembling Local Grammars

arXiv cs.CL Papers

Summary

This paper presents a method for comparing concordances of local grammars to optimize Named Entity Recognition for person names in Portuguese, achieving improved F-measure scores on the HAREM dataset.

arXiv:2605.11862v1 Announce Type: new Abstract: Named Entity Recognition for person names is an important but non-trivial task in information extraction. This article uses a tool that compares the concordances obtained from two local grammars (LG) and highlights the differences. We used the results as an aid to select the best of a set of LGs. By analyzing the comparisons, we observed relationships of inclusion, intersection and disjunction within each pair of LGs, which helped us to assemble those that yielded the best results. This approach was used in a case study on extraction of person names from texts written in Portuguese. We applied the enhanced grammar to the Gold Collection of the Second HAREM. The F-Measure obtained was 76.86, representing a gain of 6 points in relation to the state-of-the-art for Portuguese.
Original Article
View Cached Full Text

Cached at: 05/13/26, 06:19 AM

# Concordance Comparison as a Means of Assembling Local Grammars
Source: [https://arxiv.org/html/2605.11862](https://arxiv.org/html/2605.11862)
11institutetext:Universidade Federal do Espírito Santo \- UFES
Av\. Fernando Ferrari, 514, 29075\-910 Vitória, ES, Brazil
11email:jupcampos@gmail\.com, elias@lcad\.inf\.ufes\.br22institutetext:Université Paris\-Est,
LIGM, UPEM/CNRS/ENPC/ESIEE, Champs\-sur\-Marne, 77420, France
22email:eric\.laporte@univ\-paris\-est\.frElias de OliveiraEric Laportehttps://orcid\.org/0000\-0002\-0984\-0781

###### Abstract

Named Entity Recognition for person names is an important but non\-trivial task in information extraction\. This article uses a tool that compares the concordances obtained from two local grammars \(LG\) and highlights the differences\. We used the results as an aid to select the best of a set of LGs\. By analyzing the comparisons, we observed relationships of inclusion, intersection and disjunction within each pair of LGs, which helped us to assemble those that yielded the best results\. This approach was used in a case study on extraction of person names from texts written in Portuguese\. We applied the enhanced grammar to the Gold Collection of the Second HAREM\. The F\-Measure obtained was 76\.86, representing a gain of 6 points in relation to the state\-of\-the\-art for Portuguese\.

## 1Introduction

Named Entity Recognition \(NER\) involves automatically identifying names of entities such as persons, places and organizations\. Person names are a fundamental source of information\. Many applications seek information on individuals and their relationships, e\.g\. in the context of social networks\. However, extracting this type of Named Entity \(NE\) is challenging: person names are an open word class, which includes many words and grows every day\[[7](https://arxiv.org/html/2605.11862#bib.bib21)\]\.

“A good portion of NER research is devoted to the study of English, due to its significance as a dominant language that is used internationally”\[[14](https://arxiv.org/html/2605.11862#bib.bib91), page 470\]\. An influential impetus to the development of systems for this purpose in Portuguese came with the HAREM\[[13](https://arxiv.org/html/2605.11862#bib.bib18),[8](https://arxiv.org/html/2605.11862#bib.bib2)\]events, a joint assessment of the area organized by Linguateca\[[6](https://arxiv.org/html/2605.11862#bib.bib24)\]\. The annotated corpora used in the first and second HAREM, known as the Golden Collection \(GC\), serve as a reference for recent works on Portuguese NER\.

The main approaches used to develop NER systems involve \(i\) machine learning, whereby systems learn to identify and classify NEs from a training corpus, \(ii\) the linguistic approach, which involves manual description of rules in which NEs can appear, and \(iii\) a hybrid approach that combines both previous methods\.

“Local grammars \(LG\) are finite\-state grammars or finite\-state automata that represent sets of utterances of a natural language”\[[5](https://arxiv.org/html/2605.11862#bib.bib43), page 1\]\. They were introduced by Maurice Gross\[[4](https://arxiv.org/html/2605.11862#bib.bib42)\]and serve as a way to group phrases with common characteristics \(usually syntactic or semantic\)\. Describing rules in the form of LGs for the construction of Information Extraction \(IE\) systems requires human expertise and training in linguistics; little computational aid for this task is available\.

A method for constructing LGs around a keyword or semantic unit is presented by\[[5](https://arxiv.org/html/2605.11862#bib.bib43)\]\. LGs for extracting person names from Portuguese texts were presented in\[[2](https://arxiv.org/html/2605.11862#bib.bib14)\]and\[[10](https://arxiv.org/html/2605.11862#bib.bib58)\]\. In the Second HAREM\[[8](https://arxiv.org/html/2605.11862#bib.bib2)\], the Rembrandt system, which uses grammar rules and Wikipedia as sources of knowledge\[[3](https://arxiv.org/html/2605.11862#bib.bib20)\], ranked best for the ‘person’ category\. A comparison between four tools to recognize NEs in Portuguese texts\[[1](https://arxiv.org/html/2605.11862#bib.bib47)\]suggested that the rule\-based approach is the most effective for person names\. Recently, LGs have been successfully integrated in a hybrid approach to Portuguese NER\[[11](https://arxiv.org/html/2605.11862#bib.bib59)\]\.

This paper describes how to use the Unitex concordance comparison tool\[[15](https://arxiv.org/html/2605.11862#bib.bib22)\]as an aid to constructing an LG\. Our point of departure was a set of LGs to identify person names in Portuguese texts\. By comparing concordances obtained from them, we found some relationships between them in terms of set theory\. Taking into account these relationships, we picked the best LGs and combined them in order to achieve better performance\.

This article is organized as follows\. Section 2 presents the methodology used in this work\. The results of the study are presented in Section 3, and Section 4 presents conclusions and avenues for future research\.

## 2The Methodology

The input to our experiment was a repository of small LGs to recognize person names\. Some were obtained from the literature \(e\.g\. those presented in\[[2](https://arxiv.org/html/2605.11862#bib.bib14)\]\) and we created others\.

All of these LGs were created and processed with Unitex\[[15](https://arxiv.org/html/2605.11862#bib.bib22)\], an open\-source system initially developed at University of Paris\-Est Marne\-La\-Vallée in France\. A local grammar is represented as a set of one or more graphs referred to as Local Grammar Graphs \(LGG\)\. Unitex allows for creating LGGs, preprocessing texts, applying dictionaries to texts, applying LGs to extract information, generating concordances and comparing concordances\.

The LGG shown in Fig\.[1](https://arxiv.org/html/2605.11862#S2.F1)recognizes honorific titles such asSr\.,Sra\.andDr\.\(“Mr\.”, “Mrs\.”, “Dr\.”\) followed by words with the first letter capitalized, as identified by the code<PRE\>in Unitex dictionaries\. The<<\.\.\>\>after<PRE\>denotes the application of a morphological filter to words with the first letter capitalized, indicating that they must include at least two characters\. This prevents the recognition of definite articles at the beginning of sentences, for example\. Between the capitalized words, prepositions or abbreviations may occur and are recognized by two graphs,Preposicao\.grfandAbreviacoes\.grf, which have been created separately and are included as subgraphs\. Examples of phrases recognized by the graph \(occurrences\) includeSra\. Joana da SilvaandDr\. Antônio de Oliveira Salazar\. A list of occurrences accompanied with one line of context is referred to as a concordance\.

![Refer to caption](https://arxiv.org/html/2605.11862v1/x1.png)Figure 1:LGGG1G\_\{1\}\(ReconheceFormasDeTratamento\.grf\)Unitex allows for attaching outputs to graph boxes\. Outputs are displayed in bold under boxes\. In Fig\.[1](https://arxiv.org/html/2605.11862#S2.F1),<NOME\>\(“name”\) and</NOME\>shown under the arrows represent such outputs\. Unitex inserts them into the concordance when a graph is applied in the “MERGE with input text” mode\. Thus, the identified names appear enclosed in these XML tags in the concordance file\.

The LGs of the repository are small but can be combined to compose a larger grammar to identify person names\.

We applied the LGs of the repository to the Golden Collection \(GC\) of the Second HAREM, producing a concordance file for each LG\. We used Portuguese and English dictionaries because several English names appear in GC texts\.

The GC of the Second HAREM\[[8](https://arxiv.org/html/2605.11862#bib.bib2)\]is a subset of 129 annotated texts\. These texts have different textual genres and are written in European or Brazilian Portuguese\. The HAREM classifies ten categories of NEs: abstraction, event, thing, place, work, organization, person, time, value, and other\. Person names, the focus of this work, are classified as a subtype within the ‘person’ category and are represented by the code PERSON \(INDIVIDUAL\)\. In the GC of the Second HAREM, 1,609 NEs are annotated with this code\.

### 2\.1Concordance comparison

We compared all the concordances pairwise \(every pair of files\) using the ConcorDiff concordance comparison tool provided by Unitex\. This tool can be applied to any pair of concordance files, provided they are in the Unitex format, which is publicly documented in the manual\[[9](https://arxiv.org/html/2605.11862#bib.bib19)\]\.

The Unitex ConcorDiff program compares two concordance files line by line and shows their differences\. The result is an HTML page that presents alternate lines of both concordances and that leaves an empty line when an occurrence appears in only one of them\. An example is presented in Fig\.[2](https://arxiv.org/html/2605.11862#S2.F2)\. The lines with a pink background shading \(lines 1, 3, 5 and 7\) are from the first concordance \(the first parameter to ConcorDiff\), and those with a green background shading \(lines 2, 4 and 6\) are from the other concordance \(the second parameter to ConcorDiff\)\.

![Refer to caption](https://arxiv.org/html/2605.11862v1/x2.png)Figure 2:Part of a concordance comparison fileLines in blue characters \(lines 1 and 2\) are the occurrences common to the two concordances\. In the example shown in Fig\.[2](https://arxiv.org/html/2605.11862#S2.F2), this means that both LGs recognizedMichael Jackson\. Lines in red characters \(lines 3 and 4\) correspond to occurrences that overlap only partially, which is the case, for instance, when an occurrence in a concordance is part of an occurrence in the other\. In the example, an LG recognizedLuther King, and the other recognizedLuther\. Lines in green characters \(lines 5 and 7\) are the occurrences that appear in only one of the two concordances\.Antonio RicardoandChico Buarquewere recognized only by the first LG\. Lines in purple characters indicate identical occurrences with different outputs inserted, which does not happen in this example\.

We then analyzed the files generated by ConcorDiff\.

### 2\.2Composition of LG from concordance comparisons

LetGXG\_\{X\}andGYG\_\{Y\}be two LGs, and letCXC\_\{X\}andCYC\_\{Y\}the respective concordance files obtained by applying them to the same corpus\. Thus,CXC\_\{X\}is the set of occurrences identified byGXG\_\{X\}, andCYC\_\{Y\}is the set of occurrences identified byGYG\_\{Y\}\. LetCX×CYC\_\{X\}\\times C\_\{Y\}be the file that shows the differences between concordancesCXC\_\{X\}andCYC\_\{Y\}and is obtained through the ConcorDiff program of Unitex\. InCX×CYC\_\{X\}\\times C\_\{Y\}, the elementsx1x\_\{1\},x2x\_\{2\}, …,xnx\_\{n\}ofCXC\_\{X\}are displayed on a pink background, while the elementsy1y\_\{1\},y2y\_\{2\}, …,ymy\_\{m\}ofCYC\_\{Y\}are displayed on a green background\. It may exist betweenCXC\_\{X\}andCYC\_\{Y\}some relationships of the set theory, such as inclusion, intersection or disjunction, and these relationships can be observed by analyzingCX×CYC\_\{X\}\\times C\_\{Y\}\.

![Refer to caption](https://arxiv.org/html/2605.11862v1/x3.png)Figure 3:LGG2G\_\{2\}\(ReconheceNomesCompostos\.grf\)Consider, for example, LGsG1G\_\{1\}\(Fig\.[1](https://arxiv.org/html/2605.11862#S2.F1)\) andG2G\_\{2\}\(Fig\.[3](https://arxiv.org/html/2605.11862#S2.F3)\)\.G2G\_\{2\}recognizes person names stored in dictionaries, through dictionary codesN\+PRfor proper names andHumfor nouns referring to human beings\. Multiword person names such asMarilyn Monroe, Cameron DiazandAlbert Einsteinare recognized by this LG after applying the English dictionary to the input text\.

![Refer to caption](https://arxiv.org/html/2605.11862v1/x4.png)Figure 4:Part of the concordance comparisonC1×C2C\_\{1\}\\times C\_\{2\}Figure[4](https://arxiv.org/html/2605.11862#S2.F4)shows part of the concordance comparisonC1×C2C\_\{1\}\\times C\_\{2\}\. The first line,y1y\_\{1\}, includes the nameJimmy Carterrecognized byG2G\_\{2\}\. The first line displayed on a pink background,x1x\_\{1\}, includes the nameAfonso Henriquesoccurring afterD\.and recognized byG1G\_\{1\}\. Since lines in green characters are occurrences identified by only one of the two graphs, the first two occurrences were identified byG2G\_\{2\}only, and the last one byG1G\_\{1\}only\. If all the lines of the comparison are in green characters and distributed between the two background colors,C1C\_\{1\}andC2C\_\{2\}are disjoint sets: thus, both LGsG1G\_\{1\}andG2G\_\{2\}are worth retaining as subgraphs of a grammar because they recognize different names\.

Table[1](https://arxiv.org/html/2605.11862#S2.T1)summarizes the main set\-theoretic relationships identified\. Each situation has a consequence in terms of priority between LGs, for example:GXG\_\{X\}can be discarded ifGYG\_\{Y\}is retained\. After analysing relationships between all pairs of LGs, we selected a subset of LGs and combined them into a larger LG \(30 LGGs\) by invoking them in a main graph\.

Table 1:Main relationships observed through concordance comparison\[b\]

- 1CX∼CY⇔\(n=mC\_\{X\}\\sim C\_\{Y\}\\Leftrightarrow\(n=mand∀i​xi\\forall i\\;\\;x\_\{i\}overlapsyi\)y\_\{i\}\)\.

## 3Results and Discussion

We could not compare the performance of the obtained LG to the initial set of small LGs, since this set does not make up a single annotator together\. Instead, we simply evaluated two annotators, one based on the obtained LG and another on an enhanced version of it, and we compared the results to those of Rembrandt, as a widely known reference\.

We applied the obtained LG to the HAREM corpus and generated an XML file with the identified NEs, annotated according to directives of the Second HAREM\. Parts of the person names identified by LG that appear isolated in the text are also annotated\.

This file was submitted to SAHARA\[[12](https://arxiv.org/html/2605.11862#bib.bib23)\]for performance evaluation\. SAHARA is an online system for automatic evaluation for HAREM, which computes the precision, recall and F\-measure of an NER system after the user configures the evaluation and submits XML\-annotated files\.

The results obtained by applying the LG to the GC of the Second HAREM were 59\.06% for precision, 55\.22% for recall and 57\.07 for F\-measure\.

Then, we employed manual strategies to improve the performance of the LG\. In the Second HAREM, some words in lowercase letters should form part of NE111http://www\.linguateca\.pt/aval\_conjunta/HAREM/minusculas\.html\. For example, the honorific titles recognized by LGG in Fig\.[1](https://arxiv.org/html/2605.11862#S2.F1)and the person’s social position that appears before the name\. In an example provided by HAREM,222http://www\.linguateca\.pt/aval\_conjunta/HAREM/ExemplarioSegundoHAREM\.pdfA rainha Isabel II surpreendeu a Inglaterra“Queen Elizabeth II surprised England”, not only the nameIsabel, but the whole phraserainha Isabel II“Queen Elizabeth II” should be labeled as a person name\.

We adapted the LGG ReconheceFormasDeTratamento\.grf to address this issue by simply shifting the tag \(<NOME\>\) before the honorific title in the graph, so that the title belongs to the tagged NE\. Furthermore, we also used these words in lowercase letters to recognize the ‘position’ subcategory of the ‘person’ category, represented by PERSON\(POSITION\), and to recognize person names with a noun of social position in the left context\.

The results obtained by the final LG are presented in Table[2](https://arxiv.org/html/2605.11862#S3.T2)\. They were obtained with SAHARA by selecting the custom setting PERSON\(INDIVIDUAL\)\. This table also shows measures computed by SAHARA for Rembrandt, the system with the best performance for the ‘person’ category of the Second HAREM\.

Table 2:Results considering PERSON\(INDIVIDUAL\): Rembrandt vs\. final LGThe LG outperfoms Rembrandt\. The recall of the LG is approximately 10 percentage points above that of Rembrandt\.

Although our LG recognizes only the ‘individual’ and ‘position’ subtypes of the ‘person’ category, its evaluation was also carried out using SAHARA for all types of categories by selecting the PERSON\(\*\) setting\. A comparison of the obtained results with the results of the four tools presented in\[[1](https://arxiv.org/html/2605.11862#bib.bib47)\]for the ‘person’ category is shown in Table[3](https://arxiv.org/html/2605.11862#S3.T3)\.

Table 3:Results considering PERSON\(\*\): Systems in\[[1](https://arxiv.org/html/2605.11862#bib.bib47)\]vs\. final LGThe LG has a better precision\. However, as expected, it has a lower recall as it identifies fewer types of NEs: only two subtypes of the ‘person’ category \(‘individual’ and ‘position’\) are recognized, whereas the other systems recognize eight subtypes\. We believe that with the addition of rules to the LG in order to recognize other subtypes of the ‘person’ category, the recall could be further increased, improving the LG approach even more as compared to other tools\.

## 4Conclusions

This paper presented the use of the Unitex concordance comparison tool as a computational aid in manual composition of LGs\. We used this tool for the composition of an LG to identify person names in texts written in Portuguese\. The same methodology can be applied to the construction of LGs for other purposes\.

Table[1](https://arxiv.org/html/2605.11862#S2.T1)was created by listing the main set\-theoretic relationships \(inclusion, intersection and disjunction\) that we could observe when analyzing concordance\-comparison files generated by Unitex\. Taking into account these relationships, we could produce a more compact and easily understandable grammar\. We could also observe that a concordance offers an overview of what a LG recognizes in a specific corpus, allowing ambiguities and false positives to be identified\.

The results of out final LG show its potential for NE extraction\. It performed better \(gain of 6 points\) than Rembrandt, the system with the best performance for the ‘person’ category in the Second HAREM, when evaluating the ‘person’ category, ‘individual’ subtype, for which it was created\.

As avenues for future work, we plan to apply the LG approach to other corpora of texts written in Portuguese, and to assess performance with a corpus not used in the construction of the LG\. Moreover, we may add rules for recognizing other types of NEs\. We also intend to study the feasibility of building elementary LGGs automatically or semi\-automatically from examples, with the goal of minimizing human effort during construction\. The concordance comparison tool presented in this article might facilitate the automation of decision\-making for this purpose\.

## References

- \[1\]D\. O\.F\. Amaral, E\. B\. Fonseca, L\. Lopes, and R\. Vieira\(26\-31\)Comparative analysis of portuguese named entities recognition tools\.InProceedings of the Ninth International Conference on Language Resources and Evaluation \(LREC’14\),N\. C\. \(\. Chair\), K\. Choukri, T\. Declerck, H\. Loftsson, B\. Maegaard, J\. Mariani, A\. Moreno, J\. Odijk, and S\. Piperidis \(Eds\.\),Reykjavik, Iceland,pp\. 2554–2558\(english\)\.External Links:ISBN 978\-2\-9517408\-8\-4Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p5.1),[Table 3](https://arxiv.org/html/2605.11862#S3.T3),[§3](https://arxiv.org/html/2605.11862#S3.p9.1)\.
- \[2\]J\. Baptista\(1998\)A local grammar of proper nouns\.InSeminários de Linguística,Vol\.2,pp\. 21–37\.Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p5.1),[§2](https://arxiv.org/html/2605.11862#S2.p1.1)\.
- \[3\]N\. Cardoso\(2008\)REMBRANDT\-reconhecimento de entidades mencionadas baseado em relaações e análise detalhada do texto\.InIn Cristina Mota and Diana Santos \(eds\.\)\. Desafios na Avaliaação Conjunta do Reconhecimento de Entidades Mencionadas,Vol\.1,pp\. 195–211\.Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p5.1)\.
- \[4\]M\. Gross\(1997\)The construction of local grammars\.In ROCHE, E\.; SCHABÈS, Y\. \(eds\.\)\. Finite\-state language processing, Language, Speech, and Communication, Cambridge, Mass\.,pp\. 329–354\.Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p4.1)\.
- \[5\]M\. Gross\(1999\)A Bootstrap Method for Constructing Local Grammars\.InProceedings of the Symposium on Contemporary Mathematics,N\. Bokan \(Ed\.\),pp\. 229–250\.Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p4.1),[§1](https://arxiv.org/html/2605.11862#S1.p5.1)\.
- \[6\]Linguateca\(2018\)Note:Acesso em: 02/03/18External Links:[Link](http://www.linguateca.pt/)Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p2.1)\.
- \[7\]C\. D\. Manning and H\. Schütze\(1999\)Foundations of statistical natural language processing\.MIT press\.Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p1.1)\.
- \[8\]C\. Mota and D\. Santos\(2008\)Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM\.Linguateca\.External Links:ISBN 978\-989\-20\-1656\-6,[Link](https://www.linguateca.pt/LivroSegundoHAREM/)Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p2.1),[§1](https://arxiv.org/html/2605.11862#S1.p5.1),[§2](https://arxiv.org/html/2605.11862#S2.p7.1)\.
- \[9\]S\. Paumier\(2016\)Unitex 3\.1 user manual\.External Links:[Link](http://unitexgramlab.org/releases/3.1/man/Unitex-GramLab-3.1-usermanual-en.pdf)Cited by:[§2\.1](https://arxiv.org/html/2605.11862#S2.SS1.p1.1)\.
- \[10\]J\. P\. C\. Pirovani and E\. de Oliveira\(2015\-03\)Extração de Nomes de Pessoas em Textos em Português: uma Abordagem Usando Gramáticas Locais\.InComputer on the Beach 2015,Florianópolis, SC,pp\. 1–10\.Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p5.1)\.
- \[11\]J\. P\. C\. Pirovani and E\. de Oliveira\(2017\)CRF\+LG: A Hybrid Approach for the Portuguese Named Entity Recognition\.InInternational Conference on Intelligent Systems Design and Applications \(ISDA 2017\),Delhi, India\.Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p5.1)\.
- \[12\]SAHARA\(2018\)Note:Acesso em: 02/03/2018External Links:[Link](http://www.linguateca.pt/SAHARA/)Cited by:[§3](https://arxiv.org/html/2605.11862#S3.p3.1)\.
- \[13\]D\. Santos and N\. Cardoso\(2007\)Reconhecimento de entidades mencionadas em português: documentação e actas do harem, a primeira avaliação conjunta na Área\.Linguateca\.External Links:ISBN 978\-989\-20\-0731\-1,[Link](http://www.linguateca.pt/aval_conjunta/LivroHAREM/Livro-SantosCardoso2007.pdf)Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p2.1)\.
- \[14\]K\. Shaalan\(2014\)A survey of arabic named entity recognition and classification\.Computational Linguistics40\(2\),pp\. 469–510\.External Links:[Link](https://doi.org/10.1162/COLI_a_00178)Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p2.1)\.
- \[15\]\(2018\)Unitex\.Note:Acesso em: 02/03/2018External Links:[Link](http://unitexgramlab.org/)Cited by:[§1](https://arxiv.org/html/2605.11862#S1.p6.1),[§2](https://arxiv.org/html/2605.11862#S2.p2.1)\.

Similar Articles

Representation Without Reward: A JEPA Audit for LLM Fine-Tuning

arXiv cs.LG

This paper audits Joint-embedding predictive architectures (JEPA) for LLM fine-tuning on a natural-language-to-regex task, testing twenty-two auxiliary objectives. The results show that hidden-state representation improvements are only weakly coupled to decoded-task accuracy, with no auxiliary surviving family-wise correction.