Article published In: Corpus studies of language through time: Special issue of the International Journal of Corpus Linguistics 27:4 (2022)
Edited by Tony McEnery, Gavin Brookes and Isobelle Clarke
[International Journal of Corpus Linguistics 27:4] 2022
► pp. 529–553
Strategies in tracing linguistic variation in a corpus of Old Irish texts (CorPH)
Available under the Creative Commons Attribution (CC BY) 4.0 license.
For any use beyond this license, please contact the publisher at rights@benjamins.nl.
This article was made Open Access under a CC BY 4.0 license through payment of an APC by or on behalf of the authors.
Published online: 20 September 2022
https://doi.org/10.1075/ijcl.22018.sti
https://doi.org/10.1075/ijcl.22018.sti
Abstract
This article introduces Corpus PalaeoHibernicum (CorPH), a corpus currently consisting of 78 texts in Early Irish
(c. 7th–10th cent.) created by the ERC-funded Chronologicon Hibernicum (ChronHib) project by
bringing together pre-existing lexical and syntactic databases and adding further crucial texts from the period. In addition to
being annotated for POS, morphological and syntactic information, another layer of annotation has been developed for CorPH –
‘Variation Tagging’, i.e. a tagset that numerically encodes synchronic language variation during the Early Irish period, thus
allowing for much improved research on the chronological variation among the material. Another new pillar of studying linguistic
variation is Bayesian Language Variation Analysis (BLaVA), in order to address the challenge that “not-so-big data” poses to
statistical corpus methods. Instead of reflecting feature frequencies, BLaVA models language variation as probabilities of
variation.
Article outline
- 1.Introduction
- 2.Characteristics of Old Irish
- 3.The corpus
- 4.Corphusator
- 5.Variation tagging
- 6.Bayesian language variation analysis
- 7.Advantages and benefits of the methods
- 8.Challenges and desiderata
- Acknowledgements
References
References (45)
Barrett, S. (2017). A Study of the Lexicon of the Poems of Blathmac Son of Cú Brettan. [Doctoral dissertation, Maynooth University]. MURAL – Maynooth University Research Archive Library. [URL]
Bauer, B. (2015). The online database of the Old Irish Priscian Glosses. [URL]
(in preparation). Corpus Palaeohibernicum (CorPH): From an Early Irish lexical database to a text-based corpus using Python.
Bauer, B., Hofman, R., & Moran, P. (2017). St Gall Priscian Glosses (Version 2.0). [URL]
Bronner, D. (2013). Verzeichnis altirischer Quellen [Directory of Old Irish Sources]. Philipps Universität Marburg.
Claris International Inc. (2006–15). FileMaker Pro 8–14. [Computer Software]. [URL]
Dublin Institute for Advanced Studies. (2004–). Irish Script on Screen. [URL]
Evert, S. (2008). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp. 1212–1248). Mouton de Gruyter.
Färber, B. (2012–). CELT: Corpus of Electronic Texts. [URL]
Farr, F., & O’Keeffe, A. (2002). Would as a hedging device in an Irish context: An intra-varietal comparison of institutionalised spoken interaction. In S. M. Fitzmaurice, D. Biber, & R. Reppen (Eds.), Using Corpora to Explore Linguistic Variation (pp. 25–48). John Benjamins.
Gries, S. Th., & Hilpert, M. (2010). Modeling diachronic change in the third person singular: A multifactorial, verb- and author-specific exploratory approach. English Language and Linguistics, 14(3), 293–320.
Griffith, A., & Stifter, D. (2013). Dictionary and Database of the Old Irish Glosses in the Milan MS Ambr. C301 inf. [URL]
Griffith, A., Stifter, D., & Toner, G. (2018). Early Irish lexicography – A research survey. Kratylos, 631, 1–28.
Hellwig, O. (2019). Dating Sanskrit texts using linguistic features and neural networks. Indogermanische Forschungen, 1241, 1–47.
(2020). Dating and stratifying a historical corpus with a Bayesian mixture model. In R. Sprugnoli & M. Passarotti (Eds.), Proceedings of the LREC 2020 1st Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2020) (pp. 1–10). European Language Resources Association. [URL]
Hemprich, G. (in preparation). Catalogue of Medieval Irish Literature.
Hilpert, M., & Gries, S. Th. (2016). Quantitative approaches to diachronic corpus linguistics. In M. Kytö & P. Pahta (Eds.), The Cambridge Handbook of English Historical Linguistics (pp. 36–53). Cambridge University Press.
Hundt, M. (2004). Animacy, agentivity, and the spread of the progressive in Modern English. English Language & Linguistics, 8(1), 47–69.
Kavanagh, S. (2001). A Lexicon of the Old Irish Glosses in the Würzburg Manuscript of the Epistles of St. Paul (D. S. Wodtko, Ed.). Österreichische Akademie der Wissenschaften.
Kelly, P., & Fogarty, H. (2006–2011). Thesaurus Linguae Hibernicae. [URL]
Lash, E. (2014). The Parsed Old and Middle Irish Corpus (POMIC) (version 0.1). [URL]
Lash, E., Qiu, F., & Stifter, D. (2020). Introduction: Celtic studies and corpus linguistics. In E. Lash, F. Qiu, & D. Stifter (Eds.), Morphosyntactic Variation in Medieval Celtic Languages: Corpus-based Approaches (pp. 1–12). De Gruyter Mouton.
Lehmann, H. M., & Schneider, G. (2012). Syntactic variation and lexical preference in the dative-shift alternation. In J. Mukherjee & M. Huber (Eds.), Corpus Linguistics and Variation in English: Theory and Description (pp. 65–75). Rodopi.
McCone, K. (1996). Towards a Relative Chronology of Ancient and Medieval Celtic Sound Change. Maynooth.
Ó Corráin, D. (2017). Clavis Litterarum Hibernensium: Medieval Irish Books & Texts (c. 400 – c. 1600) (Vol. 1–31). Brepols.
Qiu, F., & Stifter, D. (2020). Chronologicon Hibernicum: Frámaíocht dhóchúlaíoch chun dátú a dhéanamh ar fhorbairtí i dteanga na Sean-Ghaeilge [Chronologicon Hibernicum: A probabilistic framework for the dating of Old Irish language developments]. In E. Ó Raghallaigh (Ed.), Téamaí agus Tionscadail Taighde (pp. 39–59). An Sagart.
Qiu, F., Stifter, D., Bauer, B., Lash, E., & Tianbo, J. (2018). Chronologicon Hibernicum: A probabilistic chronological framework for dating Early Irish language developments and literature. In M. Ioannides et al. (Eds.), Digital Heritage: Progress in Cultural Heritage: Documentation, Preservation, and Protection (pp. 731–740). Springer.
R Core Team (2020). R: A Language and Environment for Statistical Computing (Version 4.0.0) [Computer Software]. R Foundation for Statistical Computing. [URL]
Rögnvaldsson, E., & Helgadóttir, S. (2011). Morphosyntactic tagging of Old Icelandic texts and its use in studying syntactic variation and change. In C. Sporleder, A. Bosch, & K. Zervanou (Eds.), Language Technology for Cultural Heritage (pp. 63–76). Springer.
Sagart, L., Jacques, G., Lai, Y., Ryder, R. J., Thouzeau, V., Greenhill, S. J., & List, J. (2019). Dated language phylogenies shed light on the ancestry of Sino-Tibetan. Proceedings of the National Academy of Sciences of the USA 116(21), 10317–10322.
Schneider, G. (2008). Hybrid Long-Distance Functional Dependency Parsing [Doctoral dissertation, University of Zurich]. [URL]
Schreier, D. (2005). #CCV- > #CV-: Corpus-based evidence of historical change in English phonotactics. International Journal of English Studies, 5(1), 77–99.
Schumacher, S. (2004). Die keltischen Primärverben: Ein vergleichendes, etymologisches und morpho-logisches Lexikon [The Celtic Primary Verbs: A Comparative, Etymological and Morphological Dictionary]. Innsbruck.
Stifter, D. (2009). Early Irish. In M. Ball & N. Müller (Eds.), The Celtic Languages (2nd ed., pp. 55–116). Routledge.
Stifter, D., Barrett, S., Bauer, B., Ganly, E., Griffith, A., Ji, T., Lash, E., Nguyen, T. H., Osarobo, G., Qiu, F., & White, N. (2021–). Corpus Palaeohibernicum. [URL]
Stokes, W., & Strachan, J. (Eds.). (1901–1910). Thesaurus Palaeohibernicus: A Collection of Old Irish Glosses, Scholia, Prose and Verse. Dublin Institute for Advanced Studies.
Su, Y.-S., & Yajima, M. (2020). R2jags: Using R to Run ‘JAGS’ (Version 0.6–1). [URL]
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
