Multilingual Coreference Resolution via Cycle-Consistent Machine Translation
Summary
This paper proposes a novel pipeline for multilingual coreference resolution that uses cycle-consistent machine translation from English to low-resource languages to generate training data, validated by back-translation and BERT similarity. Experiments on four low-resource languages show significant performance gains, enabling accurate coreference resolution where no prior corpora existed.
View Cached Full Text
Cached at: 06/05/26, 08:06 AM
# Multilingual Coreference Resolution via Cycle-Consistent Machine Translation
Source: [https://arxiv.org/html/2606.05444](https://arxiv.org/html/2606.05444)
Adriana\-Valentina Costache∗, Eduard Poesina∗, Silviu\-Florin Gheorghe, Paul Irofti, Radu Tudor Ionescu⋄ Department of Computer Science, University of Bucharest, Romania ∗Equal contribution\.⋄raducu\.ionescu@gmail\.com
###### Abstract
Coreference resolution is a core NLP task, having a broad range of downstream applications, e\.g\. machine translation, question answering, document summarization, etc\. While the task is well\-studied in English, comparatively less attention is dedicated to coreference resolution in other languages, especially low\-resource ones\. To mitigate this gap, we propose a novel coreference resolution pipeline that harnesses machine translation \(MT\) from English to a target low\-resource language, to generate or expand training data\. To automatically validate the quality of the translated samples, we back\-translate the samples and assess the similarity with the original English samples via cosine similarity in the latent space of a BERT model\. The resulting similarity scores are integrated into the loss function to weight training samples according to their MT cycle consistency\. Extensive experiments on four low\-resource languages show that our pipeline brings significant performance gains in coreference resolution\. Moreover, our pipeline enables accurate coreference resolution in languages where no previous corpora were available\.
Multilingual Coreference Resolution via Cycle\-Consistent Machine Translation
Adriana\-Valentina Costache∗, Eduard Poesina∗, Silviu\-Florin Gheorghe,Paul Irofti, Radu Tudor Ionescu⋄Department of Computer Science, University of Bucharest, Romania∗Equal contribution\.⋄raducu\.ionescu@gmail\.com
## 1Introduction
Coreference resolution \(CR\) is a fundamental NLP task, which aims to identify all expressions in a text that refer to the same entity\. The first attempts at solving the CR problem were heavily based on human\-designed rules for the English languageHobbs \([1978](https://arxiv.org/html/2606.05444#bib.bib128)\); Ng \([2005](https://arxiv.org/html/2606.05444#bib.bib131)\); Ponzetto and Strube \([2006](https://arxiv.org/html/2606.05444#bib.bib130)\); Raghunathanet al\.\([2010](https://arxiv.org/html/2606.05444#bib.bib129)\)\. These types of methods are limited by the difficulty of drawing a complete list of non\-contradictory rules, and are exposed to problems associated with the statistical nature of language\. The foundational work ofLeeet al\.\([2017](https://arxiv.org/html/2606.05444#bib.bib127)\)was set to address CR by creating a fully trainable solution, without human\-designed linguistic rules\. The authors introduced the first end\-to\-end neural system, using a bidirectional LSTM to produce contextual span representations for joint mention detection in English\. Deep models later benefited from the emergence of better neural encodersJoshiet al\.\([2019](https://arxiv.org/html/2606.05444#bib.bib132)\), such as BERTDevlinet al\.\([2019](https://arxiv.org/html/2606.05444#bib.bib103)\)\. While end\-to\-end models reach competitive resultsKirstainet al\.\([2021](https://arxiv.org/html/2606.05444#bib.bib125)\); Xu and Choi \([2020](https://arxiv.org/html/2606.05444#bib.bib126)\), they usually have many task\-specific hyperparameters and are hard to tune, as stated byZhanget al\.\([2023](https://arxiv.org/html/2606.05444#bib.bib133)\)\.
More recently, researchers introduced a new category of sequence\-to\-sequence solutionsUrbizuet al\.\([2020](https://arxiv.org/html/2606.05444#bib.bib148)\); Liuet al\.\([2022](https://arxiv.org/html/2606.05444#bib.bib134)\); Bohnetet al\.\([2023](https://arxiv.org/html/2606.05444#bib.bib135)\); Straka \([2023](https://arxiv.org/html/2606.05444#bib.bib136)\), aiming to generate text representations of entity clusters\. Notably, CorPipeStraka \([2023](https://arxiv.org/html/2606.05444#bib.bib136)\)won the CRAC 2023 shared task on multilingual coreference resolution, while CorPipeEnsemble ranked first in the CRAC 2025 \(unconstrained\) edition\. Another direction of study is the use of zero\-shot large language models \(LLMs\) via prompting\.Le and Ritter \([2024](https://arxiv.org/html/2606.05444#bib.bib137)\)found that, although the zero\-shot performance of promoted LLMs is respectable, they still remain way below specialized state\-of\-the\-art models, by 10\-20% on benchmarks like CoNLL\-2012/OntoNotesPradhanet al\.\([2012](https://arxiv.org/html/2606.05444#bib.bib142)\)\. The CRAC 2025 resultsNováket al\.\([2025](https://arxiv.org/html/2606.05444#bib.bib138)\)also indicate that zero\-shot LLMs lag far behind specialized models, with a clear gap of about 13% in terms of F1\.
These empirical observations underline the utility of task\-specific datasets used to train and test specialized CR models\. However, CR datasets in certain languages are small, outdated, or entirely missing\. There are clear efforts to remedy this situation, e\.g\. the CRAC 2025 shared taskNováket al\.\([2025](https://arxiv.org/html/2606.05444#bib.bib138)\)describes CorefUD as a harmonized multilingual collection of 22 datasets in 17 languages\. By contrast, for low\-resource languages, such as Romanian, we did not find any CR datasets that can be used in the evaluation and training of specialized models\. To make things worse, it is reasonable to expect that the zero\-shot performance in such languages is even lower\.
Figure 1:Overview of the proposed pipeline for coreference resolution\. An LLM, namely Claude Sonnet 4\.6Anthropic \([2026](https://arxiv.org/html/2606.05444#bib.bib2)\), is prompted to translate annotated samples from English to the target language and back\. The cycle consistency of back\-translations is estimated via BERTScoreZhanget al\.\([2020](https://arxiv.org/html/2606.05444#bib.bib146)\)\. Finally, the CR model, namely MaverickMartinelliet al\.\([2024](https://arxiv.org/html/2606.05444#bib.bib1)\), is trained on the target language, weighting the loss of each translated sample withsps^\{p\}, wheressrepresents the BERTScore of the respective sample, andppis a hyperparameter that controls the importance of cycle consistency\. Best viewed in color\.To this end, we propose a novel CR framework that leverages existing English resources via machine translation \(MT\) to generate new training data in a target low\-resource language\. As illustrated in Figure[1](https://arxiv.org/html/2606.05444#S1.F1), we employ back\-translation and assess the overlap between original and back\-translated English samples, where the overlap is given by the cosine similarity computed in the embedding space of a pre\-trained BERT modelDevlinet al\.\([2019](https://arxiv.org/html/2606.05444#bib.bib103)\); Zhanget al\.\([2020](https://arxiv.org/html/2606.05444#bib.bib146)\)\. We conjecture that the utility of a translated data sample is proportional to its cycle consistency, i\.e\. the cosine similarity of its back\-translation\. Therefore, we integrate the cosine similarity between original and back\-translated English samples into the loss function, to weight the importance of translated samples according to their cycle consistency\.
To validate the proposed framework, we perform experiments across four low\-resource languages: French, Hungarian, Romanian and Russian\. While three of these languages have small\-scale publicly available CR datasets, there are no CR resources for Romanian\. The results confirm that our cycle\-consistent MT augmentation framework can significantly boost performance in CR across all four languages, in both training dataset expansion and training dataset generation scenarios\.
In summary, our contribution is threefold:
- •We propose a novel CR framework based on MT to generate new training samples for low\-resource languages, and modulate sample importance according to MT cycle consistency\.
- •We conduct comprehensive experiments across four low\-resource languages, showing that the proposed framework can significantly boost CR performance\.
- •We manually curate a coreference resolution test set for Romanian, thus enabling the evaluation of CR systems for this low\-resource language\.
## 2Method
Our model extends MaverickMartinelliet al\.\([2024](https://arxiv.org/html/2606.05444#bib.bib1)\)with three modifications, to make it suitable for CR in low\-resource languages\. First, we replace the English\-only encoder DeBERTa\-v3\-largeHeet al\.\([2023](https://arxiv.org/html/2606.05444#bib.bib140)\)with mmBERT\-baseMaroneet al\.\([2025](https://arxiv.org/html/2606.05444#bib.bib141)\), a multilingual encoder pre\-trained on 200\+ languages\. In this way, a single model can be employed across multiple languages\. Second, we separate training into two phases: \(i\) train the mention detector with the frozen encoder, and \(ii\) fine\-tune the encoder and the coreference heads using gold mentions as input, isolating the linking signal from mention\-detection noise\. Third, we augment the bilinear coreference scorerLeeet al\.\([2017](https://arxiv.org/html/2606.05444#bib.bib127)\)with MT cycle consistency, providing a discriminative signal, independent of encoder representations\.
Generating data via MT\.The lack of large\-scale CR resources in many languages motivates our MT\-based augmentation strategy\. We employ a highly capable LLM to perform MT, namely Claude Sonnet 4\.6Anthropic \([2026](https://arxiv.org/html/2606.05444#bib.bib2)\)\. As illustrated in Figure[1](https://arxiv.org/html/2606.05444#S1.F1), each source \(English\) document is translated using Claude SonnetAnthropic \([2026](https://arxiv.org/html/2606.05444#bib.bib2)\)via zero\-shot prompting \(the exact prompt is specified in Table[A\.1](https://arxiv.org/html/2606.05444#A1.SS1)\)\. The prompt instructs the model to produce a fluent target\-language translation, while preserving every⟦k⟧…⟨/k⟩\\llbracket k\\rrbracket\\ldots\\langle/k\\ranglespan around the target\-language equivalent of each English mention, thus maintaining all cluster identifiersk∈\{1,2,…,K\}k\\in\\\{1,2,\.\.\.,K\\\}, whereKKis the number of entities\.
Back\-translation quality scoring\.Translation errors introduce noise into the projected annotations\. To quantify this noise per document, we employ back\-translation to complete the translation cycle: each target language translation is itself submitted to Claude Sonnet 4\.6 with a symmetric prompt requesting translation back to English, while preserving all cluster markers\. The back\-translated English text is then compared with the original English source via BERTScoreZhanget al\.\([2020](https://arxiv.org/html/2606.05444#bib.bib146)\), yielding a per\-document quality scores∈\[0,1\]s\\in\[0,1\]\. Our intuition is that a high\-fidelity translation followed by a faithful back\-translation recovers text semantically close to the source, whereas a translation that drops or misaligns mentions produces a divergent back\-translation\.
Per\-document loss weighting\.Rather than applying a hard threshold to discard low\-quality documents, which would lose potentially useful training data, we incorporate BERTScore directly into the training objective\. Each documentDDcontributes with the following weight:
wD=sDp,w\_\{D\}=s\_\{D\}^\{\\,p\},\(1\)wheresDs\_\{D\}is the BERTScore between source and back\-translated versions of documentDD, andp≥0p\\geq 0controls the strength of the penalty\. Then, the weighted training objective becomes:
ℒ\(θ\)=1\|𝒟\|∑D∈𝒟wD⋅ℒD\(θ\),\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{D\\in\\mathcal\{D\}\}w\_\{D\}\\cdot\\mathcal\{L\}\_\{D\}\(\\theta\),\(2\)where𝒟\\mathcal\{D\}is the collection of translated documents \(originally available in English\),θ\\thetarepresents the parameters of the CR model, andℒD\\mathcal\{L\}\_\{D\}is the per\-document loss for the current training phase\. For training phase \(i\),ℒD\\mathcal\{L\}\_\{D\}is the standard binary cross\-entropy on mention start/end logits, while for phase \(ii\), it is the marginal log\-likelihood over gold antecedentsLeeet al\.\([2017](https://arxiv.org/html/2606.05444#bib.bib127)\)\.
Language→\\rightarrowEnglishFrenchHungarianRomanianRussianMethod→\\rightarrowBaseBase\+MT\+sps^\{p\}Base\+MT\+sps^\{p\}ZS\+MT\+sps^\{p\}Base\+MT\+sps^\{p\}MUCP95\.085\.285\.589\.387\.387\.588\.970\.186\.287\.895\.595\.695\.9R95\.782\.383\.488\.088\.989\.590\.866\.084\.886\.595\.896\.296\.5F195\.483\.784\.488\.688\.188\.589\.868\.085\.587\.191\.792\.092\.4B3P88\.180\.880\.285\.684\.284\.285\.465\.881\.583\.292\.192\.092\.3R91\.779\.080\.685\.489\.390\.291\.262\.480\.982\.493\.594\.094\.3F189\.979\.980\.785\.586\.787\.188\.264\.181\.282\.892\.892\.993\.3CEAF\-EP92\.778\.178\.683\.787\.888\.289\.162\.880\.181\.992\.692\.893\.1R86\.273\.273\.579\.080\.180\.782\.057\.975\.471\.190\.891\.191\.6F189\.475\.676\.081\.383\.884\.285\.460\.277\.779\.491\.792\.092\.4CoNLLF191\.679\.780\.485\.186\.286\.587\.864\.181\.583\.193\.493\.694\.0
Table 1:Coreference resolution results on four target languages \(French, Hungarian, Romanian, Russian\), measured with the official CoNLL\-2012 scorer\. Best score per language is highlighted inbold\. Legend:base– Maverick trained on original target language data;ZS– zero\-shot LLM \(when no original training data is available\);\+MT– Maverick trained with translated examples;\+𝐬𝐩\\mathbf\{s\}^\{\\mathbf\{p\}\}– Maverick trained with translated examples and cycle\-consistent loss weighting\. For reference, we report results on English with thebasemodel\.
## 3Experiments
Datasets\.For French, we use the ANCOR corpusMuzerelleet al\.\([2014](https://arxiv.org/html/2606.05444#bib.bib144)\), which contains 530 transcripts of spontaneous spoken French drawn from interviews, conversations, and oral surveys\. For Hungarian, we use SzegedKorefVinczeet al\.\([2018](https://arxiv.org/html/2606.05444#bib.bib145)\), a dataset comprising 320 short editorial and news documents annotated for nominal coreference\. For Russian, we use RuCorToldovaet al\.\([2014](https://arxiv.org/html/2606.05444#bib.bib151)\), a corpus formed of 180 texts covering news, scientific articles, blog posts and fiction\. For French, Hungarian and Russian, where the native gold data is small or domain\-restricted, we supplement the within\-language training data with LLM\-translated OntoNotes 5\.0 documents to expand both volume and domain diversity\. For data augmentation via MT, we choose OntoNotes 5\.0Weischedelet al\.\([2013](https://arxiv.org/html/2606.05444#bib.bib143)\); Pradhanet al\.\([2012](https://arxiv.org/html/2606.05444#bib.bib142)\)as the source English corpus, as it spans a broad range of genres: newswire, broadcast news, broadcast conversation, magazine, web text, telephone speech, and biblical text\. For Romanian, no publicly available coreference dataset exists\. We therefore construct a Romanian dataset entirely from English documents drawn from OntoNotes 5\.0Weischedelet al\.\([2013](https://arxiv.org/html/2606.05444#bib.bib143)\); Pradhanet al\.\([2012](https://arxiv.org/html/2606.05444#bib.bib142)\)\. The documents are translated with Claude Sonnet 4\.6, which is instructed to preserve annotations\. Further, the corresponding Romanian test set is manually verified and corrected by a native Romanian speaker to ensure that translation, mention boundaries, and coreference links are correct\.
Evaluation measures\.FollowingMartinelliet al\.\([2024](https://arxiv.org/html/2606.05444#bib.bib1)\), we employ three evaluation metrics, namely MUCVilainet al\.\([1995](https://arxiv.org/html/2606.05444#bib.bib149)\), B3Bagga and Baldwin \([1998](https://arxiv.org/html/2606.05444#bib.bib152)\), and CEAF\-E \(ϕ4\\phi\_\{4\}\-CEAF\)Luo \([2005](https://arxiv.org/html/2606.05444#bib.bib150)\)\. For each of them, we report the precision \(P\), recall \(R\), and F1 scores\. We also report the CoNLL F1 score, which is defined as the average F1 score of the MUC, B3and CEAF\-E metrics\.
Hyperparameter setup\.We train MaverickMartinelliet al\.\([2024](https://arxiv.org/html/2606.05444#bib.bib1)\)using AdamW, with a learning rate of10−410^\{\-4\}and a mini\-batch size of1616\. We use a gradient clipping of1\.01\.0\. We train using early stopping with a patience of2020epochs, and select the best model via CoNLL F1 on validation data\. All other hyperparameters are left to their default values\. Our pipeline introduces a single extra hyperparameter into MaverickMartinelliet al\.\([2024](https://arxiv.org/html/2606.05444#bib.bib1)\), namely the powerppin Eq\. \([1](https://arxiv.org/html/2606.05444#S2.E1)\)\. We tuneppon validation data for one language \(French\), considering values forp∈\{0\.5,1,2,3\}p\\in\\\{0\.5,1,2,3\\\}\. The optimal valuep=3p=3is kept across all languages to avoid overfitting in hyperparameter space\.
Figure 2:Ablation of hyperparameterpp, which controls the impact of loss weighting in Eq\. \([2](https://arxiv.org/html/2606.05444#S2.E2)\)\. Best viewed in color\.Results\.For French, Hungarian and Russian, we compare three alternatives: a base Maverick model \(trained on original within\-language data\), a Maverick model that benefits from MT data \(trained on both original and translated data\), and a Maverick model that benefits from MT data, but penalizes MT samples according to their cycle consistency \(back\-translation quality\)\. For Romanian, there is no within\-language data available, so we replace the base Maverick model with a zero\-shot LLM \(we use the same LLM as for MT, namely Claude Sonnet 4\.6\)\. In Table[1](https://arxiv.org/html/2606.05444#S2.T1), we present comparative results across four low\-resource languages\. The results indicate that MT\-based data augmentation is beneficial, especially for Romanian, where there are no available coreference resolution corpora\. Furthermore, we observe additional performance gains when introducing cycle\-consistent loss weighting\. Here, the improvements stem primarily from higher precision on MUC and B3, suggesting that loss weighting based onsps^\{p\}helps suppress spurious mentions introduced by translation artifacts\.
Ablation of loss weighting hyperparameter\.In Figure[2](https://arxiv.org/html/2606.05444#S3.F2), we vary the hyperparameterppin Eq\. \([1](https://arxiv.org/html/2606.05444#S2.E1)\), considering values in the set\{0,0\.5,1,2,3\}\\\{0,0\.5,1,2,3\\\}\. Note thatp=0p=0turns off the BERTScore weighting in Eq\. \([2](https://arxiv.org/html/2606.05444#S2.E2)\)\. The ablation ofppis performed on the French dataset\. The results show that higher values ofpplead to better results, confirming that our cycle\-consistent loss weighting is very useful\. However, going fromp=2p=2top=3p=3, we observe that the performance gains begin to saturate\.
Table 2:BLEU vs\. BERTScore comparison \(in terms of CoNLL F1\), as alternatives for the semantic similarity scoressused in our cycle\-consistent loss weighting, across all four target languages\.BERTScore vs\. BLEU\.In Table[2](https://arxiv.org/html/2606.05444#S3.T2), we compare two alternatives to measure MT cycle consistency, namely BLEUPapineniet al\.\([2002](https://arxiv.org/html/2606.05444#bib.bib153)\)and BERTScoreZhanget al\.\([2020](https://arxiv.org/html/2606.05444#bib.bib146)\)\. While both BLEU and BERTScore bring visible performance boosts across all four languages, BERTScore consistently outperforms BLEU\. This is likely due to the fact that BLEU does not always capture semantic relations, such as synonymity\.
## 4Conclusion
We proposed a novel pipeline for coreference resolution in low\-resource languages, which harnesses MT to augment existing datasets or generate new training data \(for languages where CR resources were not previously available\)\. We assessed MT cycle consistency and introduced it in the loss function of the CR model to modulate the importance of translated data samples accordingly\. To validate our approach, we conducted CR experiments across five low\-resource languages\. Our results demonstrated that our pipeline leads to significant performance gains, and even enables CR in languages without existing resources\. In future work, we aim to expand the list of low\-resource languages\.
## Acknowledgments
This research is supported by the project “Romanian Hub for Artificial Intelligence \- HRIA”, Smart Growth, Digitization and Financial Instruments Program, 2021\-2027, MySMIS no\. 351416\. This work is also supported by a grant of the Ministry of Research, Innovation and Digitization, CCCDI \- UEFISCDI, project number PN\-IV\-P6\-6\.3\-SOL\-2024\-0090, within PNCDI IV\.
## 5Limitations
Our framework leverages the use of a highly capable LLM in the translation phase\. While translated data is central to our framework, as it brings significant performance gains, LLM usage can also represent a downside of our framework, introducing some limitations, as detailed below\.
First, LLMs are typically power\-hungry models, having potentially negative effects on the environment due to their high energy consumption\. As humanity will gradually move towards green energy production alternatives, the importance of the energy consumption problem of LLMs will diminish in the future\. Moreover, we highlight that the translated data is meant to be reused multiple times to train and validate lighter models for coreference resolution\. Hence, we limit LLM usage to the MT step, and refrain from fine\-tuning LLMs for coreference resolution\.
Second, potential biases of the LLM may eventually be transferred into the translated data, and later be inherited by the smaller coreference resolution model\. We have manually inspected the translated examples and did not observe any age, gender, racial, or other kinds of biases\. Nevertheless, our careful inspection does not completely exclude this possibility, especially for document genres and languages that are not included in our study\.
## References
- Claude sonnet 4\.6 model card\.Technical ReportAnthropic\.External Links:[Link](https://www-cdn.anthropic.com/bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf)Cited by:[§A\.1](https://arxiv.org/html/2606.05444#A1.SS1.p1.1),[§A\.1](https://arxiv.org/html/2606.05444#A1.SS1.tab1),[Figure 1](https://arxiv.org/html/2606.05444#S1.F1),[§2](https://arxiv.org/html/2606.05444#S2.p2.3)\.
- A\. Bagga and B\. Baldwin \(1998\)Entity\-based cross\-document coreferencing using the vector space model\.InProceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics \(ACL\-COLING\),pp\. 79–85\.External Links:[Link](https://doi.org/10.3115/980845.980859),[Document](https://dx.doi.org/10.3115/980845.980859)Cited by:[§3](https://arxiv.org/html/2606.05444#S3.p2.3)\.
- B\. Bohnet, C\. Alberti, and M\. Collins \(2023\)Coreference resolution through a seq2seq transition\-based system\.Transactions of the Association for Computational Linguistics11,pp\. 212–226\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00543)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p2.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\),pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423/),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p1.1),[§1](https://arxiv.org/html/2606.05444#S1.p4.1)\.
- P\. He, J\. Gao, and W\. Chen \(2023\)DeBERTaV3: improving DeBERTa using ELECTRA\-Style pre\-training with gradient\-disentangled embedding sharing\.InProceedings of International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=sE7-XhLxHA)Cited by:[§2](https://arxiv.org/html/2606.05444#S2.p1.1)\.
- J\. R\. Hobbs \(1978\)Resolving pronoun references\.Lingua44\(4\),pp\. 311–338\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0024-3841%2878%2990006-2)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p1.1)\.
- M\. Joshi, O\. Levy, L\. Zettlemoyer, and D\. S\. Weld \(2019\)BERT for coreference resolution: baselines and analysis\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 5802–5807\.External Links:[Link](https://aclanthology.org/D19-1588/)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p1.1)\.
- Y\. Kirstain, O\. Ram, and O\. Levy \(2021\)Coreference resolution without span representations\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 14–19\.External Links:[Link](https://aclanthology.org/2021.acl-short.3/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-short.3)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p1.1)\.
- N\. T\. Le and A\. Ritter \(2024\)Are large language models robust coreference resolvers?\.InProceedings of Conference on Language Modeling \(COLM\),External Links:[Link](https://openreview.net/pdf?id=MmBQSNHKUl)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p2.1)\.
- K\. Lee, L\. He, M\. Lewis, and L\. Zettlemoyer \(2017\)End\-to\-end neural coreference resolution\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 188–197\.External Links:[Document](https://dx.doi.org/10.18653/v1/D17-1018)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p1.1),[§2](https://arxiv.org/html/2606.05444#S2.p1.1),[§2](https://arxiv.org/html/2606.05444#S2.p4.8)\.
- T\. Liu, Y\. E\. Jiang, N\. Monath, R\. Cotterell, and M\. Sachan \(2022\)Autoregressive structured prediction with language models\.InFindings of the Association for Computational Linguistics: EMNLP,pp\. 993–1005\.External Links:[Link](https://aclanthology.org/2022.findings-emnlp.70/),[Document](https://dx.doi.org/10.18653/v1/2022.findings-emnlp.70)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p2.1)\.
- X\. Luo \(2005\)On coreference resolution performance metrics\.InProceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing \(HLT\-EMNLP\),pp\. 25–32\.External Links:[Link](https://aclanthology.org/H05-1004/)Cited by:[§3](https://arxiv.org/html/2606.05444#S3.p2.3)\.
- M\. Marone, O\. Weller, W\. Fleshman, E\. Yang, D\. Lawrie, and B\. Van Durme \(2025\)mmBERT: A modern multilingual encoder with annealed language learning\.arXiv preprint arXiv:2509\.06888\.External Links:[Link](https://arxiv.org/abs/2509.06888)Cited by:[§2](https://arxiv.org/html/2606.05444#S2.p1.1)\.
- G\. Martinelli, E\. Barba, and R\. Navigli \(2024\)Maverick: efficient and accurate coreference resolution defying recent trends\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 13380–13394\.External Links:[Link](https://aclanthology.org/2024.acl-long.722/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.722)Cited by:[Figure 1](https://arxiv.org/html/2606.05444#S1.F1),[§2](https://arxiv.org/html/2606.05444#S2.p1.1),[§3](https://arxiv.org/html/2606.05444#S3.p2.3),[§3](https://arxiv.org/html/2606.05444#S3.p3.8)\.
- J\. Muzerelle, A\. Lefeuvre, E\. Schang, J\. Antoine, A\. Pelletier, D\. Maurel, I\. Eshkol, and J\. Villaneau \(2014\)ANCOR\_Centre, a large free spoken French coreference corpus: description of the resource and reliability measures\.InProceedings of the 9th International Conference on Language Resources and Evaluation \(LREC\),pp\. 843–847\.External Links:[Link](https://aclanthology.org/L14-1169/)Cited by:[§3](https://arxiv.org/html/2606.05444#S3.p1.1)\.
- V\. Ng \(2005\)Supervised ranking for pronoun resolution: some recent improvements\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),pp\. 1081–1086\.External Links:[Link](https://cdn.aaai.org/AAAI/2005/AAAI05-171.pdf)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p1.1)\.
- M\. Novák, M\. Konopik, A\. Nedoluzhko, M\. Popel, O\. Prazak, J\. Sido, M\. Straka, Z\. Žabokrtský, and D\. Zeman \(2025\)Findings of the fourth shared task on multilingual coreference resolution: can LLMs dethrone traditional approaches?\.InProceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference \(CRAC\),pp\. 95–118\.External Links:[Link](https://aclanthology.org/2025.crac-1.9/),[Document](https://dx.doi.org/10.18653/v1/2025.crac-1.9)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p2.1),[§1](https://arxiv.org/html/2606.05444#S1.p3.1)\.
- K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 311–318\.External Links:[Link](https://aclanthology.org/P02-1040/),[Document](https://dx.doi.org/10.3115/1073083.1073135)Cited by:[§3](https://arxiv.org/html/2606.05444#S3.p6.1)\.
- S\. P\. Ponzetto and M\. Strube \(2006\)Exploiting semantic role labeling, WordNet and Wikipedia for coreference resolution\.InProceedings of the Human Language Technology Conference of the NAACL \(NAACL\-HLT\),pp\. 192–199\.External Links:[Link](https://aclanthology.org/N06-1025/)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p1.1)\.
- S\. Pradhan, A\. Moschitti, N\. Xue, O\. Uryupina, and Y\. Zhang \(2012\)CoNLL\-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes\.InProceedings of Joint Conference on EMNLP and CoNLL \- Shared Task,pp\. 1–40\.External Links:[Link](https://aclanthology.org/W12-4501/)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p2.1),[§3](https://arxiv.org/html/2606.05444#S3.p1.1)\.
- K\. Raghunathan, H\. Lee, S\. Rangarajan, N\. Chambers, M\. Surdeanu, D\. Jurafsky, and C\. D\. Manning \(2010\)A multi\-pass sieve for coreference resolution\.InProceedings of the 2010 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 492–501\.External Links:[Link](https://aclanthology.org/D10-1048/)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p1.1)\.
- M\. Straka \(2023\)ÚFAL CorPipe at CRAC 2023: larger context improves multilingual coreference resolution\.InProceedings of the CRAC 2023 Shared Task on Multilingual Coreference Resolution \(CRAC\),Z\. Žabokrtský and M\. Ogrodniczuk \(Eds\.\),pp\. 41–51\.External Links:[Link](https://aclanthology.org/2023.crac-sharedtask.4/),[Document](https://dx.doi.org/10.18653/v1/2023.crac-sharedtask.4)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p2.1)\.
- S\. Toldova, A\. Roytberg, A\.A\. Ladygina, M\.D\. Vasilyeva, I\.L\. Azerkovich, M\. Kurzukov, G\. Sim, D\.V\. Gorshkov, A\. Ivanova, A\. Nedoluzhko, and Y\. Grishina \(2014\)RU\-EVAL\-2014: Evaluating anaphora and coreference resolution for Russian\.InProceedings of the Annual International Conference on Computational Linguistics and Intellectual Technologies \(Dialogue\),Vol\.13,pp\. 681–694\.External Links:[Link](https://dialogue-conf.org/digests/dialog2014/materials/pdf/ToldovaSJu.pdf)Cited by:[§3](https://arxiv.org/html/2606.05444#S3.p1.1)\.
- G\. Urbizu, A\. Soraluze, and O\. Arregi \(2020\)Sequence to sequence coreference resolution\.InProceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference \(CRAC\),pp\. 39–46\.External Links:[Link](https://aclanthology.org/2020.crac-1.5)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p2.1)\.
- M\. Vilain, J\. Burger, J\. Aberdeen, D\. Connolly, and L\. Hirschman \(1995\)A model\-theoretic coreference scoring scheme\.InSixth Message Understanding Conference \(MUC\-6\): Proceedings of a Conference Held in Columbia,pp\. 45–52\.External Links:[Link](https://aclanthology.org/M95-1005/)Cited by:[§3](https://arxiv.org/html/2606.05444#S3.p2.3)\.
- V\. Vincze, K\. Hegedűs, A\. Sliz\-Nagy, and R\. Farkas \(2018\)SzegedKoref: a Hungarian coreference corpus\.InProceedings of the 11th International Conference on Language Resources and Evaluation \(LREC\),pp\. 401–405\.External Links:[Link](http://www.lrec-conf.org/proceedings/lrec2018/pdf/325.pdf)Cited by:[§3](https://arxiv.org/html/2606.05444#S3.p1.1)\.
- R\. Weischedel, M\. Palmer, M\. Marcus, E\. Hovy, S\. Pradhan,et al\.\(2013\)OntoNotes Release 5\.0\.Technical reportLinguistic Data Consortium\.External Links:[Link](https://catalog.ldc.upenn.edu/LDC2013T19)Cited by:[§3](https://arxiv.org/html/2606.05444#S3.p1.1)\.
- L\. Xu and J\. D\. Choi \(2020\)Revealing the myth of higher\-order inference in coreference resolution\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 8527–8533\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.686/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.686)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p1.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with BERT\.InProceedings of International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/pdf?id=SkeHuCVFDr)Cited by:[Figure 1](https://arxiv.org/html/2606.05444#S1.F1),[§1](https://arxiv.org/html/2606.05444#S1.p4.1),[§2](https://arxiv.org/html/2606.05444#S2.p3.1),[§3](https://arxiv.org/html/2606.05444#S3.p6.1)\.
- W\. Zhang, S\. Wiseman, and K\. Stratos \(2023\)Seq2seq is all you need for coreference resolution\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),pp\. 11493–11504\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.704/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.704)Cited by:[§1](https://arxiv.org/html/2606.05444#S1.p1.1)\.
## Appendix AAppendix
### A\.1Translation Prompt
To translate English documents annotated with entity clusters for coreference resolution, we employ Claude Sonnet 4\.6Anthropic \([2026](https://arxiv.org/html/2606.05444#bib.bib2)\)\. The generic prompt used during translation from English to a target language<lang\>is given in Table[A\.1](https://arxiv.org/html/2606.05444#A1.SS1)\. In the prompt template,<lang\>is replaced with one of target languages, namely French, Hungarian, Romanian and Russian\. To translate documents back to English, we use a symmetric prompt\. The employed prompt comprises precise rules, especially regarding the preservation of annotations, which are particularly important for the underlying CR task\. We also exemplify the rules via an example, to further explain to the LLM how the translation should be performed\.
You are translating English text to <lang\> while PRESERVING coreference cluster annotations\.The input text contains inline coreference markers:\-\- \[\[N\]\]word\[/N\] marks a mention belonging to cluster N \(an integer\)\-\- All mentions of the SAME entity share the SAME cluster ID\-\- Markers can be nested: \[\[1\]\]the CEO of \[\[2\]\]Acme\[/2\]\[/1\]YOUR TASK: 1\. Translate the entire text to fluent, natural <lang\>\. 2\. CRITICAL: every English mention \[\[N\]\]\.\.\.\[/N\] MUST appear in the <lang\> translation with the SAME cluster ID N, wrapping the <lang\> equivalent of that mention\. 3\. Pronouns count as mentions\. If ‘‘he’’ appears with ID 5 in English, the <lang\> equivalent pronoun \(or whichever inflected form fits\) MUST also be marked \[\[5\]\]\.\.\.\[/5\]\. 4\. If a mention is dropped because <lang\> doesn’t express it overtly \(e\.g\. pro\-drop subject\), still emit empty markers \[\[N\]\]\[/N\] at the dropped position to preserve the cluster\. 5\. Keep brackets balanced and properly nested\. Every \[\[N\]\] MUST have a matching \[/N\]\. 6\. Do NOT introduce new cluster IDs\. Use only the IDs present in the English text\. 7\. Output ONLY the <lang\> text with annotations\. No explanations, no preamble, no markdown fences, just the annotated translation\.EXAMPLE INPUT: \[\[1\]\]John\[/1\] went to \[\[2\]\]the store\[/2\]\. \[\[1\]\]He\[/1\] bought \[\[3\]\]milk\[/3\], and \[\[1\]\]John\[/1\] told \[\[4\]\]his wife\[/4\] about \[\[3\]\]it\[/3\]\.EXAMPLE OUTPUT: \[\[1\]\]Ion\[/1\] s\-a dus \[\[2\]\]la magazin\[/2\]\. \[\[1\]\]El\[/1\] a cumpărat \[\[3\]\]lapte\[/3\], iar \[\[1\]\]Ion\[/1\] i\-a spus \[\[4\]\]soţiei sale\[/4\] despre \[\[3\]\]asta\[/3\]\.NOW TRANSLATE THIS TEXT \(output the <lang\> translation with annotations only\): \{english\_text\}
Table 3:Prompt used for Claude Sonnet 4\.6Anthropic \([2026](https://arxiv.org/html/2606.05444#bib.bib2)\)to translate documents annotated with coreference resolution clusters from English to a low\-resource language<lang\>, where<lang\>is one of the following four languages: French, Hungarian, Romanian, Russian\. The output example is shown for Romanian\.### A\.2Compute Environment
We perform all our experiments on an academic compute environment, namely a workstation with a single Nvidia GeForce GTX 3090 GPU with 24 GB of VRAM\. The reported results represent averages over three runs\.
### A\.3Romanian Data Annotation
The annotator employed to verify and correct the English→\\rightarrowRomanian translations is an adult who holds a master degree at a university located in Romania\. The recruited annotator willingly agreed to engage in the annotation process, after agreeing to our terms and conditions\. The authors provided accurate and complete instructions regarding the annotation task\. The annotator was also given the LLM prompt\. A fair compensation \(25 EUR per hour\) was paid to the annotator, upon completing the annotations\. This is almost double the average wage in Romania \(13\.6 EUR per hour\)111[https://www\.romania\-insider\.com/eurostat\-romanians\-working\-hours\-salaries\-april\-2026](https://www.romania-insider.com/eurostat-romanians-working-hours-salaries-april-2026)\. The authors verified the manual annotations to confirm that the annotation task was carefully completed by the recruited annotator, according to the provided instructions\.
### A\.4Romanian Data License Agreement
The Romanian version of OntoNotes 5\.0 will be released under the LDC User Agreement for Non\-Members\.Similar Articles
Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution
This paper presents a two-stage adaptation method for LLM-based multilingual coreference resolution, achieving first place in the LLM track of CRAC 2026 with a CoNLL F1 of 74.32. The approach fine-tunes Gemma-3-27b using a multilingual base adapter followed by dataset-specific adapters.
Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency
This paper proposes a self-supervised framework using multilingual self-consistency and a self-critique mechanism to transfer cultural knowledge across languages, achieving a 5.03% average improvement on English queries in the BLEnD benchmark by surfacing latent cultural knowledge from local-language representations.
LLMBridge: An LLM Pipeline for End-to-end Referential Bridging Resolution in English
LLMBridge introduces an LLM-based pipeline for end-to-end referential bridging resolution, achieving state-of-the-art performance on three English datasets. The system combines heuristic pre/post-processing with LLM natural language inference.
Enhancing Scientific Discourse: Machine Translation for the Scientific Domain
This paper presents the development of parallel and monolingual corpora for scientific machine translation across Spanish-English, French-English, and Portuguese-English, targeting four domains: Cancer Research, Energy Research, Neuroscience, and Transportation. The corpora are used to fine-tune neural machine translation systems, addressing challenges of specialized vocabulary and syntax in scientific text.
Reasoning over Grammar: Can Synthetic Linguistic Reasoning Traces Enhance Low-Resource Machine Translation?
Large language models can improve translation for low-resource languages through structured linguistic reasoning traces, with the most significant benefits occurring during inference rather than training.