Natural Language Processing for Ancient Greek: Design, advantages and challenges of language models

Stopponi, Silvia; Pedrazzini, Nilo; Peels-Matthey, Saskia; McGillivray, Barbara; Nissim, Malvina

doi:10.1075/dia.23013.sto

Article published In: Demystifying New Methods in Historical Linguistics
Edited by Erich Round
[Diachronica 41:3] 2024
► pp. 414–435

Get fulltext from our e-platform

Download PDF

Download EPUB

Natural Language Processing for Ancient Greek

Design, advantages and challenges of language models

Silvia Stopponi | University of Groningen

Nilo Pedrazzini | The Alan Turing Institute

Saskia Peels-Matthey | University of Groningen

Barbara McGillivray | King’s College London

Malvina Nissim | University of Groningen

Available under the Creative Commons Attribution (CC BY) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Open Access publication of this article was funded through a Transformative Agreement with University of Groningen.

Published online: 2 July 2024

https://doi.org/10.1075/dia.23013.sto

Abstract

Computational methods have produced meaningful and usable results to study word semantics, including semantic change. These methods, belonging to the field of Natural Language Processing, have recently been applied to ancient languages; in particular, language modelling has been applied to Ancient Greek, the language on which we focus. In this contribution we explain how vector representations can be computed from word co-occurrences in a corpus and can be used to locate words in a semantic space, and what kind of semantic information can be extracted from language models. We compare three different kinds of language models that can be used to study Ancient Greek semantics: a count-based model, a word embedding model and a syntactic embedding model; and we show examples of how the quality of their representations can be assessed. We highlight the advantages and potential of these methods, especially for the study of semantic change, together with their limitations.

Keywords: Ancient Greek, semantic change, computational linguistics, language models, Natural Language Processing, word embeddings, semantic space

Résumé

Les méthodes computationnelles ont produit des résultats significatifs et utilisables pour étudier la sémantique des mots, y compris le changement sémantique. Ces méthodes, qui appartiennent au domaine du traitement automatique des langues, ont été appliquées récemment aux langues anciennes. Notamment, la modélisation du langage a été appliquée au grec ancien, la langue sur laquelle nous nous concentrons. Dans cette contribution on explique comment des vecteurs de mots peuvent être calculés à partir de cooccurrences dans un corpus et comment ils peuvent être utilisés pour localiser les mots dans un espace sémantique. On explique aussi quel type d’information peut être extrait des modèles de langage. On compare trois différents types de modèles de langue qui peuvent être utilisés pour étudier la sémantique du grec ancien: un modèle à décomptage (count-based), un modèle Word2vec (qui produit des plongements de mots, ‘word embeddings’) et des plongements de mots enrichis d’information syntactique. On présente des exemples montrant comment la qualité de ces représentations de mots peut être évaluée. On met en évidence les avantages et les potentialités de ces méthodes, notamment pour étudier le changement sémantique, ainsi que leurs limites.

Zusammenfassung

Es hat sich gezeigt, dass rechnerische Methoden aussagekräftige und nutzbare Ergebnisse für die Untersuchung der Wortsemantik, einschließlich semantischer Veränderungen, liefern. Diese Methoden, die zum Bereich des Natural Language Processing gehören, werden seit kurzem auf alte Sprachen angewendet, Insbesondere auf das Altgriechische, das hier von Interesse ist. In diesem Beitrag erklären wir, wie Vektordarstellungen aus Wort-Kookkurrenzen in einem Korpus berechnet und zur Lokalisierung von Wörtern in einem semantischen Raum verwendet werden können und welche Art semantischer Informationen aus Sprachmodellen extrahiert werden können. Wir vergleichen drei verschiedene Arten von Sprachmodellen, die zur Untersuchung altgriechischer Semantik angewendet werden können: ein zählbasiertes Modell, ein Worteinbettungsmodell und ein syntaktisches Einbettungsmodell, und wir zeigen an Beispielen, wie die Qualität ihrer Darstellungen bewertet werden kann. Wir zeigen die Vorteile und das Potenzial dieser Methoden auf, insbesondere für die Untersuchung semantischer Veränderungen, beleuchten aber auch deren Grenzen.

Article outline

1.Introduction: Language modelling for Ancient Greek
2.Annotation and existing annotated corpora of Ancient Greek
3.Distributional spaces
4.Count-based models and word embeddings: Potential and limitations
5.Computational studies on semantic change in Ancient Greek
6.Evaluation of the performance of language models
7.Syntactic word embeddings
8.Conclusions
Acknowledgements
Author contributions
Notes
Abbreviations
References

References (31)

References

Al-Ghezi, Ragheb & Mikko Kurimo. 2020. Graph-based syntactic word embeddings. In Dmitry Ustalov, Swapna Somasundaran, Alexander Panchenko, Fragkiskos D. Malliaros, Ioana Hulpuș, Peter Jansen & Abhik Jana (eds.), Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs), 72–78.

Bamman, David & Gregory Crane. 2011. The Ancient Greek and Latin dependency treebanks. In Caroline Sporleder, Antal van den Bosch & Kalliopi Zervanou (eds.), Language technology for cultural heritage: Selected papers from the LaTeCH [Language Technology for Cultural Heritage] workshop series (Theory and Applications of Natural Language Processing), 79–98. Berlin & Heidelberg: Springer.

Bianchi, Federico, Valerio Di Carlo, Paolo Nicoli & Matteo Palmonari. 2020. Compass-aligned distributional embeddings for studying semantic differences across corpora. ArXiv. [URL]. (24 August, 2023.)

Boschetti, Federico. 2009. A corpus-based approach to philological issues. Trento, Italy: University of Trento thesis.

Boschetti, Federico, Riccardo Del Gratta & Harry Diakoff. 2016. Open Ancient Greek WordNet 0.5’. Pisa: ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics “A. Zampolli”, National Research Council, in Pisa. [URL]. (24 August, 2023.)

Di Carlo, Valerio, Federico Bianchi & Matteo Palmonari. 2019. Training temporal word embeddings with a compass. AAAI-19 [Association for the Advancement of Artificial Intelligence] Conference on Artificial Intelligence, 33(1). 6326–6334.

Gorman, Vanessa B. 2020. Dependency treebanks of Ancient Greek prose. Journal of Open Humanities Data 6(1).

Grover, Aditya & Jure Leskovec. 2016. Node2vec: Scalable feature learning for networks. In Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu Aggarwal, Dou Shen & Rajeev Rastogi (eds.), Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), 855–864.

Gulordava, Kristina & Marco Baroni. 2011. A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. In Sebastian Pado & Yves Peirsman (eds.), Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, 67–71.

Hamilton, William L., Jure Leskovec & Dan Jurafsky. 2016. Diachronic word embeddings reveal statistical laws of semantic change. In Katrin Erk & Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics [ACL], 1489–1501. Berlin: Association for Computational Linguistics.

Harris, Zellig S. 1954. Distributional structure. Word 10(2–3). 146–162.

Haug, Dag T. T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. In Caroline Sporleder, Antal van den Bosch & Claire Grover (eds.), Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH), 27–34.

Kaiser, Jens, Sinan Kurtyigit, Serge Kotchourko & Dominik Schlechtweg. 2021. Effects of pre- and post-processing on type-based embeddings in lexical semantic change detection. In Paola Merlo, Jorg Tiedemann & Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics [EACL], 125–137.

Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens & Toon Van Hal. 2019. Creating, enriching and valorizing treebanks of Ancient Greek. In Marie Candito, Kilian Evang, Stephan Oepen & Djamé Seddah (eds.), Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019), 109–117.

Kozlowski, Austin C., Matt Taddy & James A. Evans. 2019. The geometry of culture: Analyzing the meanings of class through word embeddings. American Sociological Review 84(5). 905–949.

Kulkarni, Vivek, Rami Al-Rfou, Bryan Perozzi & Steven Skiena. 2015. Statistically significant detection of linguistic change. In Aldo Gangemi, Stefano Leonardi & Alessandro Panconesi (eds.), WWW ’15: Proceedings of the 24th International World Wide Web Conference, 625–635. New York: Association for Computing Machinery.

Lenci, Alessandro & Magnus Sahlgren. 2023. Distributional semantics (Studies in Natural Language Processing). Cambridge: Cambridge University Press.

Levy, Omer & Yoav Goldberg. 2014. Dependency-based word embeddings. In Kristina Toutanova & Hua Wu (eds.), Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 302–308. Baltimore: Association for Computational Linguistics.

McGillivray, Barbara. 2014. Methods in Latin computational linguistics. Leiden: Brill.

. 2022. How to use word embeddings for Natural Language Processing. SAGE Publications Ltd. (24 August, 2023.)

Mikolov, Tomas, Kai Chen, Greg Corrado & Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. ArXiv. [URL]. (24 August, 2023.)

Perrone, Valerio, Marco Palma, Simon Hengchen, Alessandro Vatri, Jim Q. Smith & Barbara McGillivray. 2019. GASC: Genre-aware semantic change for Ancient Greek. In Nina Tahmasebi, Lars Borin, Adam Jatowt & Yang Xu (eds.), Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, 56–66.

Perrone, Valerio, Simon Hengchen, Marco Palma, Alessandro Vatri, Jim Q. Smith & Barbara McGillivray. 2021. Lexical semantic change for Ancient Greek and Latin. In Tahmasebi, Nina, Lars Borin, Adam Jatowt, Yang Xu & Simon Hengchen (eds.), Computational approaches to semantic change (Language Variation 6), 287–310. Berlin: Language Science Press.

Rodda, Martina A., Marco S. G. Senaldi & Alessandro Lenci. 2017. Panta rei: Tracking semantic change with distributional semantics in Ancient Greek. Italian Journal of Computational Linguistics 3(1). 11–24.

Rodda, Martina A., Philomen Probert & Barbara McGillivray. 2019. Vector space models of Ancient Greek word meaning, and a case study on Homer. TAL Traitement Automatique des Langues 60(3). 63–87.

Sandhan, Jivnesh, Om Adideva Paranjay, Komal Digumarthi, Laxmidhar Behra & Pawan Goyal. 2023. Evaluating neural word embeddings for Sanskrit. In Amba Kulkarni & Oliver Hellwig (eds.), Proceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference, 21–37. Canberra: Association for Computational Linguistics.

Sprugnoli, Rachele, Giovanni Moretti & Marco Passarotti. 2020. Building and comparing lemma embeddings for Latin: Classical Latin versus Thomas Aquinas. IJCoL. Italian Journal of Computational Linguistics 6(6–1). 29–45.

Stopponi, Silvia, Saskia Peels-Matthey & Malvina Nissim. 2024. AGREE: A new benchmark for the evaluation of distributional semantic models of Ancient Greek. Digital Scholarship in the Humanities. (26 January, 2024.)

Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek corpus. Research Data Journal for the Humanities and Social Sciences 3(1). 55–65.

Vierros, Marja & Erik Henriksson. 2021. PapyGreek treebanks: A dataset of linguistically annotated Greek documentary papyri. Journal of Open Humanities Data 71.

Tognini-Bonelli, Elena. 2001. Corpus linguistics at work. Amsterdam: John Benjamins.

Cited by (2)

Cited by two other publications

Luo, Dayou, Kejin Wang, Dongming Wang, Anuj Sharma, Wengui Li & In Ho Choi

2025. Artificial intelligence in the design, optimization, and performance prediction of concrete materials: a comprehensive review. npj Materials Sustainability 3:1

Tzanoulinou, Diamanto, Loukas Triantafyllopoulos & Vassilios S. Verykios

2025. Harnessing Language Models for Studying the Ancient Greek Language: A Systematic Review. Machine Learning and Knowledge Extraction 7:3 ► pp. 71 ff.

This list is based on CrossRef data as of 8 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.