AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels
Summary
This paper introduces AthDGC, the first openly licensed dependency-parsed treebank of Greek spanning eight diachronic periods, with verse-level cross-alignment to four ancient Indo-European languages using NLP tools like Stanza, LaBSE, and multilingual-BERT.
View Cached Full Text
Cached at: 06/16/26, 11:49 AM
# AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels
Source: [https://arxiv.org/abs/2606.15510](https://arxiv.org/abs/2606.15510)
Authors:[Nikolaos Lavidas](https://arxiv.org/search/cs?searchtype=author&query=Lavidas,+N),[Kiki Nikiforidou](https://arxiv.org/search/cs?searchtype=author&query=Nikiforidou,+K),[Dag Haug](https://arxiv.org/search/cs?searchtype=author&query=Haug,+D),[Leonid Kulikov](https://arxiv.org/search/cs?searchtype=author&query=Kulikov,+L),[Vassiliki Geka](https://arxiv.org/search/cs?searchtype=author&query=Geka,+V),[Vassileios Symeonidis](https://arxiv.org/search/cs?searchtype=author&query=Symeonidis,+V),[Theodoros Michalareas](https://arxiv.org/search/cs?searchtype=author&query=Michalareas,+T),[Sofia Chionidi](https://arxiv.org/search/cs?searchtype=author&query=Chionidi,+S),[Anastasia Tsiropina](https://arxiv.org/search/cs?searchtype=author&query=Tsiropina,+A),[Eleni Plakoutsi](https://arxiv.org/search/cs?searchtype=author&query=Plakoutsi,+E),[Evangelos Argyropoulos](https://arxiv.org/search/cs?searchtype=author&query=Argyropoulos,+E)
[View PDF](https://arxiv.org/pdf/2606.15510)
> Abstract:AthDGC \("Athens\-PROIEL"\) is an open, end\-to\-end workflow and dataset\. It is, to the best of our knowledge, the first openly licensed dependency\-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2\.0 schema, with verse\-level cross\-alignment of the New Testament to Latin \(Vulgate\), Gothic \(Wulfila\), Old Church Slavonic \(Marianus\), and Classical Armenian\. AthDGC builds on the PROIEL Treebank Family \(Haug and Johndal 2008; Eckhoff et al\. 2018\), which established the schema and the Koine\-Greek reference set for the project\. Annotation uses the Stanford Stanza PROIEL\-trained workflow; sentence\-level alignment uses LaBSE, a multilingual sentence\-embedding model; word\-level alignment uses multilingual\-BERT attention through the AwesomeAlign procedure\. The v0\.4 release provides curated samples and the open\-source toolkit; the full annotated corpus partitions remain under v0\.5 audit on the Greek national HPC\. Quantitative scale, per\-witness verse counts, and per\-period annotated\-row counts are reported in the v0\.5 release notes, after the audit pass completes\. Concept DOI:[https://doi\.org/10\.5281/zenodo\.20439182](https://doi.org/10.5281/zenodo.20439182)\.
## Submission history
From: Nikolaos Lavidas \[[view email](https://arxiv.org/show-email/218ff916/2606.15510)\] **\[v1\]**Sat, 13 Jun 2026 23:38:39 UTC \(120 KB\)Similar Articles
A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text
This paper presents a reproducible pipeline for building Universal Dependencies-style parsing resources for Katharevousa Greek parliamentary text, including OCR reconstruction, LLM-assisted annotation, and evaluation of multiple parsers. The best model (XLM-R) achieves 0.8893 UPOS accuracy and 0.5162 LAS, significantly outperforming off-the-shelf baselines.
Meet UD_Czech-PDTC: A Large and Genre-Rich Treebank in Universal Dependencies
This paper introduces UD_Czech-PDTC, a large and genre-diverse treebank for Czech in the Universal Dependencies framework, derived from the Prague Dependency Treebank-Consolidated. It describes the conversion process and differences between annotation schemes.
Prague Dependency Treebank -- Consolidated 2.0: Enriching a Complex Annotation Scheme
We present the second consolidated version of the Prague Dependency Treebank, a 4-million-token manual multilingual annotation resource covering morphology, syntax, semantics, coreference, and discourse, along with compatible lexicons.
Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation
Georgetown researchers boost low-resource Coptic-to-English translation by augmenting in-context prompts with Universal Dependencies syntactic parses alongside bilingual glosses, setting a new state-of-the-art.
DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset
This paper introduces DraDDP, the first publicly available English multimodal dataset for multi-party dialogue discourse parsing, built from American TV dramas with 495 segments, 6,374 utterances, and 9.1 hours of video. Benchmarks show multimodal information improves parsing of dialogue structures and relation types.