Article published In: Demystifying New Methods in Historical Linguistics
Edited by Erich Round
[Diachronica 41:3] 2024
► pp. 414–435
Natural Language Processing for Ancient Greek
Design, advantages and challenges of language models
Available under the Creative Commons Attribution (CC BY) 4.0 license.
For any use beyond this license, please contact the publisher at rights@benjamins.nl.
Open Access publication of this article was funded through a Transformative Agreement with University of Groningen.
Published online: 2 July 2024
https://doi.org/10.1075/dia.23013.sto
https://doi.org/10.1075/dia.23013.sto
Abstract
Computational methods have produced meaningful and usable results to study word semantics, including semantic
change. These methods, belonging to the field of Natural Language Processing, have recently been applied to ancient languages; in
particular, language modelling has been applied to Ancient Greek, the language on which we focus. In this contribution we explain
how vector representations can be computed from word co-occurrences in a corpus and can be used to locate words in a semantic space,
and what kind of semantic information can be extracted from language models. We compare three different kinds of language models
that can be used to study Ancient Greek semantics: a count-based model, a word embedding model and a syntactic embedding model;
and we show examples of how the quality of their representations can be assessed. We highlight the advantages and potential of
these methods, especially for the study of semantic change, together with their limitations.
Résumé
Les méthodes computationnelles ont produit des résultats significatifs et utilisables pour étudier la
sémantique des mots, y compris le changement sémantique. Ces méthodes, qui appartiennent au domaine du traitement automatique des
langues, ont été appliquées récemment aux langues anciennes. Notamment, la modélisation du langage a été appliquée au grec ancien,
la langue sur laquelle nous nous concentrons. Dans cette contribution on explique comment des vecteurs de mots peuvent être
calculés à partir de cooccurrences dans un corpus et comment ils peuvent être utilisés pour localiser les mots dans un espace
sémantique. On explique aussi quel type d’information peut être extrait des modèles de langage. On compare trois différents types
de modèles de langue qui peuvent être utilisés pour étudier la sémantique du grec ancien: un modèle à décomptage (count-based), un
modèle Word2vec (qui produit des plongements de mots, ‘word embeddings’) et des plongements de mots enrichis d’information
syntactique. On présente des exemples montrant comment la qualité de ces représentations de mots peut être évaluée. On met en
évidence les avantages et les potentialités de ces méthodes, notamment pour étudier le changement sémantique, ainsi que leurs
limites.
Zusammenfassung
Es hat sich gezeigt, dass rechnerische Methoden aussagekräftige und nutzbare Ergebnisse für die
Untersuchung der Wortsemantik, einschließlich semantischer Veränderungen, liefern. Diese Methoden, die zum Bereich des Natural
Language Processing gehören, werden seit kurzem auf alte Sprachen angewendet, Insbesondere auf das Altgriechische, das hier von
Interesse ist. In diesem Beitrag erklären wir, wie Vektordarstellungen aus Wort-Kookkurrenzen in einem Korpus berechnet und zur
Lokalisierung von Wörtern in einem semantischen Raum verwendet werden können und welche Art semantischer Informationen aus
Sprachmodellen extrahiert werden können. Wir vergleichen drei verschiedene Arten von Sprachmodellen, die zur Untersuchung
altgriechischer Semantik angewendet werden können: ein zählbasiertes Modell, ein Worteinbettungsmodell und ein syntaktisches
Einbettungsmodell, und wir zeigen an Beispielen, wie die Qualität ihrer Darstellungen bewertet werden kann. Wir zeigen die
Vorteile und das Potenzial dieser Methoden auf, insbesondere für die Untersuchung semantischer Veränderungen, beleuchten aber auch
deren Grenzen.
Article outline
- 1.Introduction: Language modelling for Ancient Greek
- 2.Annotation and existing annotated corpora of Ancient Greek
- 3.Distributional spaces
- 4.Count-based models and word embeddings: Potential and limitations
- 5.Computational studies on semantic change in Ancient Greek
- 6.Evaluation of the performance of language models
- 7.Syntactic word embeddings
- 8.Conclusions
- Acknowledgements
- Author contributions
- Notes
- Abbreviations
References
References (31)
Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based
syntactic word embeddings. In Dmitry Ustalov, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings
of the Graph-based Methods for Natural Language Processing
(TextGraphs), 72–78.
Bamman, David & Gregory Crane. 2011. The
Ancient Greek and Latin dependency treebanks. In Caroline Sporleder, Antal van den Bosch & Kalliopi Zervanou (eds.), Language
technology for cultural heritage: Selected papers from the LaTeCH [Language Technology for Cultural Heritage] workshop
series (Theory and Applications of Natural Language
Processing), 79–98. Berlin & Heidelberg: Springer.
Bianchi, Federico, Valerio Di Carlo, Paolo Nicoli & Matteo Palmonari. 2020. Compass-aligned
distributional embeddings for studying semantic differences across
corpora. ArXiv. [URL]. (24 August,
2023.)
Boschetti, Federico. 2009. A
corpus-based approach to philological issues. Trento, Italy: University of Trento thesis.
Boschetti, Federico, Riccardo Del Gratta & Harry Diakoff. 2016. Open
Ancient Greek WordNet 0.5’. Pisa: ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics “A.
Zampolli”, National Research Council, in Pisa. [URL]. (24 August, 2023.)
Di Carlo, Valerio, Federico Bianchi & Matteo Palmonari. 2019. Training
temporal word embeddings with a compass. AAAI-19 [Association for the Advancement of Artificial
Intelligence] Conference on Artificial
Intelligence, 33(1). 6326–6334.
Gorman, Vanessa B. 2020. Dependency treebanks of Ancient
Greek prose. Journal of Open Humanities
Data 6(1).
Grover, Aditya & Jure Leskovec. 2016. Node2vec:
Scalable feature learning for networks. In Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu Aggarwal, Dou Shen & Rajeev Rastogi (eds.), Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD
’16), 855–864.
Gulordava, Kristina & Marco Baroni. 2011. A
distributional similarity approach to the detection of semantic change in the Google Books Ngram
corpus. In Sebastian Pado & Yves Peirsman (eds.), Proceedings
of the GEMS 2011 Workshop on GEometrical Models of Natural Language
Semantics, 67–71.
Hamilton, William L., Jure Leskovec & Dan Jurafsky. 2016. Diachronic
word embeddings reveal statistical laws of semantic change. In Katrin Erk & Noah A. Smith (eds.), Proceedings
of the 54th Annual Meeting of the Association for Computational Linguistics
[ACL], 1489–1501. Berlin: Association for Computational Linguistics.
Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating
a parallel treebank of the Old Indo-European Bible
translations. In Caroline Sporleder, Antal van den Bosch & Claire Grover (eds.), Proceedings
of the Second Workshop on Language Technology for Cultural Heritage Data
(LaTeCH), 27–34.
Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects
of pre- and post-processing on type-based embeddings in lexical semantic change
detection. In Paola Merlo, Jorg Tiedemann & Reut Tsarfaty (eds.), Proceedings
of the 16th Conference of the European Chapter of the Association for Computational Linguistics
[EACL], 125–137.
Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating,
enriching and valorizing treebanks of Ancient Greek. In Marie Candito, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings
of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest
2019), 109–117.
Kozlowski, Austin C., Matt Taddy & James A. Evans. 2019. The
geometry of culture: Analyzing the meanings of class through word embeddings. American
Sociological
Review 84(5). 905–949.
Kulkarni, Vivek, Rami Al-Rfou, Bryan Perozzi & Steven Skiena. 2015. Statistically
significant detection of linguistic change. In Aldo Gangemi, Stefano Leonardi & Alessandro Panconesi (eds.), WWW
’15: Proceedings of the 24th International World Wide Web
Conference, 625–635. New York: Association for Computing Machinery.
Lenci, Alessandro & Magnus Sahlgren. 2023. Distributional
semantics (Studies in Natural Language
Processing). Cambridge: Cambridge University Press.
Levy, Omer & Yoav Goldberg. 2014. Dependency-based
word embeddings. In Kristina Toutanova & Hua Wu (eds.), Proceedings
of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers), 302–308. Baltimore: Association for Computational Linguistics.
. 2022. How
to use word embeddings for Natural Language Processing. SAGE Publications Ltd. (24 August, 2023.)
Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient
estimation of word representations in vector space. ArXiv. [URL]. (24 August, 2023.)
Perrone, Valerio, Marco Palma, Simon Hengchen, Alessandro Vatri, Jim Q. Smith & Barbara McGillivray. 2019. GASC:
Genre-aware semantic change for Ancient Greek. In Nina Tahmasebi, Lars Borin, Adam Jatowt & Yang Xu (eds.), Proceedings
of the 1st International Workshop on Computational Approaches to Historical Language
Change, 56–66.
Perrone, Valerio, Simon Hengchen, Marco Palma, Alessandro Vatri, Jim Q. Smith & Barbara McGillivray. 2021. Lexical
semantic change for Ancient Greek and Latin. In Tahmasebi, Nina, Lars Borin, Adam Jatowt, Yang Xu & Simon Hengchen (eds.), Computational
approaches to semantic change (Language Variation
6), 287–310. Berlin: Language Science Press.
Rodda, Martina A., Marco S. G. Senaldi & Alessandro Lenci. 2017. Panta
rei: Tracking semantic change with distributional semantics in Ancient
Greek. Italian Journal of Computational
Linguistics 3(1). 11–24.
Rodda, Martina A., Philomen Probert & Barbara McGillivray. 2019. Vector
space models of Ancient Greek word meaning, and a case study on Homer. TAL Traitement
Automatique des
Langues 60(3). 63–87.
Sandhan, Jivnesh, Om Adideva Paranjay, Komal Digumarthi, Laxmidhar Behra & Pawan Goyal. 2023. Evaluating
neural word embeddings for Sanskrit. In Amba Kulkarni & Oliver Hellwig (eds.), Proceedings
of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit
Conference, 21–37. Canberra: Association for Computational Linguistics.
Sprugnoli, Rachele, Giovanni Moretti & Marco Passarotti. 2020. Building
and comparing lemma embeddings for Latin: Classical Latin versus Thomas Aquinas. IJCoL. Italian
Journal of Computational
Linguistics 6(6–1). 29–45.
Stopponi, Silvia, Saskia Peels-Matthey & Malvina Nissim. 2024. AGREE:
A new benchmark for the evaluation of distributional semantic models of Ancient Greek. Digital
Scholarship in the Humanities. (26 January,
2024.)
Vatri, Alessandro & Barbara McGillivray. 2018. The
Diorisis Ancient Greek corpus. Research Data Journal for the Humanities and Social
Sciences 3(1). 55–65.
Vierros, Marja & Erik Henriksson. 2021. PapyGreek
treebanks: A dataset of linguistically annotated Greek documentary papyri. Journal of Open
Humanities Data 71.
Tognini-Bonelli, Elena. 2001. Corpus
linguistics at work. Amsterdam: John Benjamins.
Cited by (2)
Cited by two other publications
Luo, Dayou, Kejin Wang, Dongming Wang, Anuj Sharma, Wengui Li & In Ho Choi
This list is based on CrossRef data as of 8 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
