AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

arXiv cs.CL 06/16/26, 04:00 AM Papers

diachronic treebank greek indo-european nlp dependency-parsing open-source

Summary

This paper introduces AthDGC, the first openly licensed dependency-parsed treebank of Greek spanning eight diachronic periods, with verse-level cross-alignment to four ancient Indo-European languages using NLP tools like Stanza, LaBSE, and multilingual-BERT.

arXiv:2606.15510v1 Announce Type: new Abstract: AthDGC ("Athens-PROIEL") is an open, end-to-end workflow and dataset. It is, to the best of our knowledge, the first openly licensed dependency-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2.0 schema, with verse-level cross-alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC builds on the PROIEL Treebank Family (Haug and Johndal 2008; Eckhoff et al. 2018), which established the schema and the Koine-Greek reference set for the project. Annotation uses the Stanford Stanza PROIEL-trained workflow; sentence-level alignment uses LaBSE, a multilingual sentence-embedding model; word-level alignment uses multilingual-BERT attention through the AwesomeAlign procedure. The v0.4 release provides curated samples and the open-source toolkit; the full annotated corpus partitions remain under v0.5 audit on the Greek national HPC. Quantitative scale, per-witness verse counts, and per-period annotated-row counts are reported in the v0.5 release notes, after the audit pass completes. Concept DOI: 10.5281/zenodo.20439182.

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:49 AM

# AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels
Source: [https://arxiv.org/abs/2606.15510](https://arxiv.org/abs/2606.15510)
Authors:[Nikolaos Lavidas](https://arxiv.org/search/cs?searchtype=author&query=Lavidas,+N),[Kiki Nikiforidou](https://arxiv.org/search/cs?searchtype=author&query=Nikiforidou,+K),[Dag Haug](https://arxiv.org/search/cs?searchtype=author&query=Haug,+D),[Leonid Kulikov](https://arxiv.org/search/cs?searchtype=author&query=Kulikov,+L),[Vassiliki Geka](https://arxiv.org/search/cs?searchtype=author&query=Geka,+V),[Vassileios Symeonidis](https://arxiv.org/search/cs?searchtype=author&query=Symeonidis,+V),[Theodoros Michalareas](https://arxiv.org/search/cs?searchtype=author&query=Michalareas,+T),[Sofia Chionidi](https://arxiv.org/search/cs?searchtype=author&query=Chionidi,+S),[Anastasia Tsiropina](https://arxiv.org/search/cs?searchtype=author&query=Tsiropina,+A),[Eleni Plakoutsi](https://arxiv.org/search/cs?searchtype=author&query=Plakoutsi,+E),[Evangelos Argyropoulos](https://arxiv.org/search/cs?searchtype=author&query=Argyropoulos,+E)

[View PDF](https://arxiv.org/pdf/2606.15510)

> Abstract:AthDGC \("Athens\-PROIEL"\) is an open, end\-to\-end workflow and dataset\. It is, to the best of our knowledge, the first openly licensed dependency\-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2\.0 schema, with verse\-level cross\-alignment of the New Testament to Latin \(Vulgate\), Gothic \(Wulfila\), Old Church Slavonic \(Marianus\), and Classical Armenian\. AthDGC builds on the PROIEL Treebank Family \(Haug and Johndal 2008; Eckhoff et al\. 2018\), which established the schema and the Koine\-Greek reference set for the project\. Annotation uses the Stanford Stanza PROIEL\-trained workflow; sentence\-level alignment uses LaBSE, a multilingual sentence\-embedding model; word\-level alignment uses multilingual\-BERT attention through the AwesomeAlign procedure\. The v0\.4 release provides curated samples and the open\-source toolkit; the full annotated corpus partitions remain under v0\.5 audit on the Greek national HPC\. Quantitative scale, per\-witness verse counts, and per\-period annotated\-row counts are reported in the v0\.5 release notes, after the audit pass completes\. Concept DOI:[https://doi\.org/10\.5281/zenodo\.20439182](https://doi.org/10.5281/zenodo.20439182)\.

## Submission history

From: Nikolaos Lavidas \[[view email](https://arxiv.org/show-email/218ff916/2606.15510)\] **\[v1\]**Sat, 13 Jun 2026 23:38:39 UTC \(120 KB\)

AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

Similar Articles

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

Meet UD_Czech-PDTC: A Large and Genre-Rich Treebank in Universal Dependencies

Prague Dependency Treebank -- Consolidated 2.0: Enriching a Complex Annotation Scheme

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset

Submit Feedback

Similar Articles

A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text

Meet UD_Czech-PDTC: A Large and Genre-Rich Treebank in Universal Dependencies

Prague Dependency Treebank -- Consolidated 2.0: Enriching a Complex Annotation Scheme

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset