Statistical evidence for the Proto-Indo-European-Euskarian hypothesis: A word-list approach integrating phonotactics

Blevins, Juliette; Sproat, Richard

doi:10.1075/dia.19014.ble

Article published In: Diachronica
Vol. 38:4 (2021) ► pp.506–564

Get fulltext from our e-platform

Download PDF

Statistical evidence for the Proto-Indo-European-Euskarian hypothesis

A word-list approach integrating phonotactics

Juliette Blevins | The Graduate Center, CUNY

Richard Sproat | Google (Japan)

Published online: 6 May 2021

https://doi.org/10.1075/dia.19014.ble

Abstract

Based on a new reconstruction of Proto-Basque, and regular sound correspondences between this Proto-Basque and Proto-Indo-European as standardly reconstructed, argues that Proto-Basque and Proto-Indo-European have a common ancestor that pre-dates the two proto-languages. Part of this argument is based on proposed Proto-Indo-European/Proto-Basque cognate sets that include basic vocabulary items. In this study we offer statistical support for Blevins’ conclusions by using a Monte Carlo simulation that allows us to estimate the probability that the proposed lexical correspondences could have arisen by chance. The method makes use of phonotactic language models to generate possible words in a pair of languages, and then attempts to discover consistent correspondences between the words, producing a list of possible “cognates”. The method differs from some previous approaches by considering matches between all segments in the word pairs. By running such a simulation a large number of times, we can estimate the probability that two languages with the given phonotactics could have produced the number of cognate pairs observed in the actual data. The method is independently assessed by comparing wordlists from 100 pairs of languages, related and unrelated, where relations are known. Our conclusion is that the proposed correspondences are unlikely to have arisen by chance, supporting a distant relationship between Proto-Basque as reconstructed by and Proto-Indo-European.

Keywords: Proto-Indo-European-Euskarian hypothesis, Proto-Basque, statistical evidence, Monte Carlo simulation, phonotactics, long-distance relationships

Résumé

À partir d’une nouvelle reconstruction du proto-basque et de correspondances régulières entre les phonèmes de cette proto-langue et de la reconstruction établie du proto-européen, soutient que le proto-basque et le proto-indo-européen descendent d’un ancêtre commun qui serait antérieur aux deux proto-langues. Dans cette étude, nous introduisons de nouveaux arguments statistiques venant à l’appui des conclusions de Blevins. Nous proposons une simulation de Monte Carlo qui nous permet d’estimer la probabilité d’une apparition par hasard des correspondances lexicales avancées par Blevins. Notre méthode s’appuie sur des modèles de langage phonotactiques. Elle génère d’abord un ensemble de mots possibles, puis tente de découvrir les correspondances cohérentes entre ces mots. Notre méthode diffère des approches précédentes en ce qu’elle considère d’emblée toutes les correspondances possibles entre tous les segments de toutes les paires de mots. L’exécution répétée un très grand nombre de fois d’une telle simulation permet d’estimer la probabilité selon laquelle deux langues auraient pu produire le nombre de paires apparentées observées dans les données existantes. Nous en concluons qu’il est invraisemblable que les correspondances proposées par Blevins aient pu apparaître par hasard.

Zusammenfassung

Basierend auf einer neuen Rekonstruktion des Proto-Baskischen und regelmäßigen Lautkorrespondenzen zwischen dieser Protosprache und dem proto-indogermanischen Rekonstruktionsstandard erörtert , dass Proto-Baskisch und Proto-Indogermanisch von einer gemeinsamen Protosprache abstammen, die beiden Sprachen zeitlich vorrausgeht. In dieser Studie stützen wir Blevins Schlussfolgerungen statistisch mittels einer Monte-Carlo Simulation. Unsere Methode erlaubt es, die Wahrscheinlichkeit abzuschätzen, dass die vorgeschlagenen lexikalischen Entsprechungen zufällig hätten entstehen können. Die Methode beruht auf phonotaktischen Sprachmodellen, welche potentielle Wörter generieren und dann mögliche konsistente Entsprechungen zwischen diesen Wörtern erkennen. Unsere Methode unterscheidet sich dadurch von existierenden Ansätzen, dass sie Entsprechungen zwischen allen Segmenten der Wortpaare in Betracht zieht. Die mehrfache Wiederholung einer solchen Simulation erlaubt eine Einschätzung der Wahrscheinlichkeit, dass zwei Sprachen die Anzahl an Wortpaarentsprechungen erzeugt, die in den Daten beobachtet werden. Daraus schließen wir, dass die von Blevins vorgeschlagenen Korrespondenzen nur unwahrscheinlich zufällig entstanden sind.

Article outline

1.The Proto-Indo-European-Euskarian hypothesis
- 1.1Eliminating potential loans
- 1.2Comparing basic vocabulary
2.Cognate sets and semantic lumping
3.Computational methods for lexically based assessments of genetic relatedness
- 3.1Computing probabilities of chance similarities
- 3.2Earlier modeling of cognate detection, and the significance of cognates
- 3.3A synopsis of our proposed method
4.The experiment
- 4.1Language pairs and word lists
- 4.2Language model training
- 4.3Language models
- 4.4Alignment and potential cognates
- 4.5Simulations
5.Results
6.Verifying the method with a large list of language pairs
7.Concluding remarks
Acknowledgements
Notes
Abbreviations
References

References (77)

References

Albright, Adam. 2009. Feature-based generalisation as a source of gradient acceptability. Phonology 261:9–41.

Ariztimuño, Borja, Eneko Zuloaga & Dorota Krajewska. 2019. Against the Proto-Indo-European hypothesis, or why Basque continues to be a language isolate. Talk presented at the Societas Linguistica Europaea 52nd Annual Meeting, Leipzig, Germany.

Blasi, Damián E., Søren Wichmann, Harald Hammarström, Peter F. Stadler & Morten H. Christiansen. 2016. Sound-meaning association biases evidenced across thousands of languages. Proceedings of the National Academy of Sciences. 113(39): 10818–10823.

Blevins, Juliette. 2018. Advances in Proto-Basque reconstruction with evidence for the Proto-Indo-European-Euskarian hypothesis. London & New York: Routledge.

. 2020. Derivational patterns in Proto-Basque word structure. In Pavel Stekauer and Lívia Körtvélyessy (eds.), The complexity of complex words, 222–243. Cambridge: Cambridge University Press.

Blust, Robert. 2013. The Austronesian languages. Canberra: Pacific Linguistics.

Blust, Robert & Stephen Trussel. ongoing. The Austronesian comparative dictionary. Revision 12/17/2016. [URL]

Buck, Carl Darling. 1949. A dictionary of selected synonyms in the principal Indo-European languages. Chicago: University of Chicago Press.

Campbell, Lyle & William J. Poser. 2008. Language classification: History and method. Cambridge: Cambridge University Press.

Covington, Michael. 1996. An algorithm to align words for historical comparison. Computational Linguistics 221:481–496.

Derksen, Rick. 2008. Etymological dictionary of the Slavic inherited lexicon. Leiden: Brill.

Dockum, Rikker & Claire Bowern. 2019. Swadesh lists are not long enough: Drawing phonological generalizations from limited data. Language Documentation and Description 161: 35–54.

Dolgopolsky, Aharon. 1964. Гипотеза древнейшего родства языковых семей северной евразии с вероятностной точки зрения. [A probabilistic hypothesis concerning the oldest relationships among the languages of northern Eurasia.] Voprosy yazykoznaniya. 21: 53–63.

Dunkel, George E. 2014. Lexikon der indogermanischen Partikeln und Pronominalstämme (21 vols.). Heidelberg: Universitätsverlag Winter.

Dunn, Michael & Angela Terrill. 2012. Assessing the evidence for a Central Solomons Papuan family using the Oswalt Monte Carlo Test. Diachronica 291:1–27.

Egurtzegi, Ander. 2013. Phonetics and phonology. In Martínez-Areta, Mikel (ed.) 2013. Basque and Proto-Basque. Language-internal and typological approaches to linguistic reconstruction [Mikroglottika 5], 119–172. Frankfurt am Main: Peter Lang.

. 2014. Towards a phonetically grounded diachronic phonology of Basque. PhD dissertation, Euskal Herriko Unibertsitatea.

Fortson, Benjamin W. IV. 2010. Indo-European language and culture: An introduction. Second edition. Oxford: Wiley-Blackwell.

François, Alexandre. 2008. Semantic maps and the typology of colexification: Intertwining polysemous networks across languages. In M. Vanhove (ed.), From polysemy to semantic change, 163–215. Amsterdam: John Benjamins.

Gamkrelidze, Thomas & Vjačeslav Ivanov. 1995. Indo-European and the Indo-Europeans. (Trans. Johanna Nichols), Berlin and New York: Mouton de Gruyter.

GCSE (no date) (9–1) Classical Greek J292/01 Language defined vocabulary list and restricted vocabulary list. [URL]

Goddard, Ives. 1975. Algonquian, Wiyot, and Yurok: Proving a distant genetic relationship. In M. Dale Kinade, Kenneth L. Hale & Oswald Werner (eds.), Linguistics and anthropology: In honor of C.F. Voegelin, 249–262. Lisse: Peter de Ridder Press.

Gorman, Kyle. 2016. Pynini: A Python library for weighted finite-state grammar compilation. In Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata, 75–80.

Gorrochategui, Joaquín. 1984. Estudio sobre la onomástica indígena de Aquitania. Bilbao: University of the Basque Country and University of Salamanca.

Haspelmath, Martin & Uri Tadmor (eds.). 2009. Loanwords in the world’s languages: A comparative handbook. Berlin: Mouton de Gruyter.

Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller & Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica 421: 331–354.

Jäger, Gerhard. 2013. Phylogenetic inference from word lists using weighted alignment with empirically determined weights. Language Dynamics and Change 3(2): 245–291.

Jäger, Gerhard, Johann-Mattis List & Pavel Sofroniev. 2017. Using support-vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists. In Proceedings of the European ACL 2017, 1205–1216.

Jansche, Martin. 2003. Inference of string mappings for speech technology. PhD dissertation, The Ohio State University.

Johansson, Niklas Erben, Andrey Anikin, Gerd Carling & Arthur Holmer. 2020. The typology of sound symbolism: Defining macro-concepts via their semantic and phonetic features. Linguistic Typology.

Jurafsky, Dan & James Martin. 2018. Speech and language processing. Third edition draft. [URL]

Kassian, Alexei, Mikhail Zhivlov & George Starostin. 2015. Proto-Indo-European-Uralic comparison from the probabilistic point of view. The Journal of Indo-European Studies. 43(3–4): 301–347.

Kessler, Brett. 2001. The significance of word lists: Statistical tests for investigating historical connections between languages. Stanford, CA: CSLI Publications. Distributed by The University of Chicago Press.

. 2015. Computational and quantitative approaches to historical phonology. In P. Honeybone & J. Salmons (eds.), The Oxford handbook of historical phonology, 133–148. Oxford: Oxford University Press.

Kloekhorst, Alwin. 2008. Etymological dictionary of the Hittite inherited lexicon. Amsterdam: Brill.

Kondrak, Grzegorz. 2000. A new algorithm for the alignment of phonetic sequences. In Proceedings of NAACL 2000, 288–295. San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

. 2002. Algorithms for language reconstruction. PhD dissertation, University of Toronto.

Lakarra, Joseba, Julen Manterola & Iñaki Segurola (eds.). 2019. Euskal Hiztegi Historiko-Etimologikoa (EHHE-200). Bilbo: Euskaltzaindia.

Linguistics Research Center (no date). Ancient Sanskrit online. Sanskrit base form dictionary. Linguistics Research Center, University of Texas at Austin. [URL]

List, Johann-Mattis. 2012. SCA. Phonetic alignment based on sound classes. In M. Slavkovik & D. Lassiter (eds.), New directions in logic, language, and computation, 32–51. Berlin and Heidelberg: Springer.

. 2014. Sequence comparison in historical linguistics. Düsseldorf: Düsseldorf University Press.

List, Johann-Mattis & Steven Moran. 2013. An open source toolkit for quantitative historical linguistics. In Proceedings of the ACL 2013 System Demonstrations.

List, Johann-Mattis, Thomas Mayer, Anselm Terhalle & Matthias Urban (eds.). 2014. CLICS: Database of cross-linguistic colexifications. Marburg: Forschungszentrum Deutscher Sprachatlas. [URL]

List, Johann-Mattis, Simon Greenhill, Cormac Anderson, Thomas Mayer, Tiago Tresoldi & Rober Forkel (eds.). 2019. CLICS³. [accessed at [URL]]

List, Johann-Mattis, Mary Walworth, Simon Greenhill, Tiago Tresoldi & Robert Forkel. 2018. Sequence comparison in computational historical linguistics. Journal of Language Evolution 3(2): 130–144.

Martínez-Areta, Mikel (ed.) 2013. Basque and Proto-Basque. Language-internal and typological approaches to linguistic reconstruction [Mikroglottika 5]. Frankfurt am Main: Peter Lang.

Michelena, Luis. 1961. Fonética histórica vasca. First edition. Donostia-San Sebastián.

. 1977. [2011]. Fonética histórica vasca (Luis Michelena. Obras Completas VI. 2nd edition. In J. A. Lakarra & I. Ruiz Arzalluz (eds.), Obras completas VI, Supplements of ASJU 591. Donostia-San Sebastián. Donostia-San Sebastian, Vitoria-Gasteiz: Diputación Foral de Guipuzcoa, University of the Basque Country.

Michelena, Luis & Ibon Sarasola. 1987–2005. Orotariko Euskal Hiztegia [OEH], [General Basque Dictionary]. 161 volumes. Bilbao: Euskaltzaindia. [updated online version accessed at [URL]]

Nichols, Johanna. 1996. The comparative method as heuristic. In M. Durie & M. Ross (eds.), The comparative method reviewed: Regularity and irregularity in language change, 39–71. Oxford: Oxford University Press.

Orel, Vladimir. 1998. Albanian etymological dictionary. Leiden: Brill.

Orotariko Euskal Hiztegia [OEH]. See Michelena, Luis & Ibon Sarasola.

Oswalt, Robert L. 1970. The detection of remote linguistic relationships. Computer Studies in the Humanities and Verbal Behavior 31:117–129.

Ratcliffe, Robert. 2015. On calculating the reliability of the comparative method at long and medium distances: Afroasiatic comparative lexica as a test case. Journal of Historical Linguistics. 2(2): 239–281.

Ringe, Don, Tandy Warnow & Ann Taylor. 2002. Indo-European and computational cladistics. Transactions of the Philological Society 100(1):59–129.

Ristad, Eric & Peter Yianilos. 1998. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(5):522–532.

Rix, Helmut, Martin Kümmel, Thomas Zehnder, Reiner Lipp & Brigitte Schirmer. 2001. Lexikon der Indogermanischen Verben. Wiesbaden: Dr. Ludwig Reichert Verlag.

. 2014. Lexikon der Indogermanischen Verben: Die Wurzeln und ihre Primärstammbildungen. Unter der Leitung von Helmut Rix und der Mitarbeit vieler anderer bearbeitet von Martin Kümmel, Thomas Zehnder, Reiner Lipp, Brigitte Schirmer. Third edition, electronic file, March 2014.

Roark, Brian, Michael Riley, Cyril Allauzen, Terry Tai & Richard Sproat. 2012. The OpenGrm open-source finite-state grammar software libraries. ACL 2012, Jeju Island, Korea, July.

Roark, Brian & Richard Sproat. 2007. Computational approaches to morphology and syntax. Oxford: Oxford University Press.

Schrijver, Peter. 2002. Irish ainder, Welsh anner, Breton annoar, Basque andere. In D. Restle & Dietmar Zaefferer (eds.) Sounds and systems: Studies in structures and change: A festschrift for Theo Vennemann, 205–219. Berlin & New York: Mouton de Gruyter.

Slaska, Natalia. 2006. Meaning lists in lexicostatistical studies: Evaluation, application, ramifications. PhD dissertation, University of Sheffield.

St. Arnaud, Adam, David Beck & Grzegorz Kondrak. 2017. Identifying cognate sets across dictionaries of related languages. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Copenhagen. 2519–2528.

Swadesh, Morris. 1952. Lexicostatistic dating of prehistoric ethnic contacts. Proceedings of the American Philosophical Society 961: 452–463.

. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 211:121–137.

. 1971. The origin and diversification of language. Joel F. Sherzer (ed.). Chicago: Aldine-Atherton.

Tadmor, Uri, Martin Haspelmath & Bradley Taylor. 2010. Borrowability and the notion of basic vocabulary. Diachronica 271: 226–264.

Tahmasebi, Nina, Lars Borin & Adam Jatowt. 2018. Survey of computational approaches to diachronic conceptual change detection. [URL]

Tai, Terry, Wojciech Skut & Richard Sproat. 2011. Thrax: An open source grammar compiler built on OpenFst. ASRU 2011, Waikoloa Resort, Hawaii, December.

Teeter, Karl V. 1964. Algonquian languages and genetic relationship. In Horace G. Lunt (ed.) Proceedings of the Ninth International Congress of Linguists, 1026–1034. The Hague: Mouton.

Trask, Robert L. 1997. The history of Basque. London: Routledge.

2003. Where do mama/papa words come from? Ms. [URL]

2008. Etymological dictionary of Basque. Posthumous edition. Unpublished. Edited for the web by M. W. Wheeler. University of Sussex.

Turchin, Peter, Ilia Peiros & Murray Gell-Mann. 2010. Analyzing genetic connections between languages by matching consonant classes. Journal of Language Relationship 31: 117–126.

Uhlenbeck, C. C. 1909–1910. Contribution à une phonétique comparative des dialectes basques. Revista Internacional de los Estudios Vascos 31: 465–503; 41. 65–188.

Vendryes, Joseph. 1959. Lexique étymologique de l’irlandais ancien: Lettre A. Paris: CNRS.

Wodtko, Dagmar S., Britta Irslinger & Carolin Schneider. 2008. Nomina im Indogermanischen Lexikon. Heidelberg: Universitäts verlag Winter.

Cited by (5)

Cited by five other publications

Order by:

Manterola, Julen

2025. Etymologies in a language isolate. In Investigating Language Isolates [Typological Studies in Language, 135], ► pp. 104 ff.

Blum, Frederic, Carlos Barrientos, Adriano Ingunza & Johann-Mattis List

2024. Cognate reflex prediction as hypothesis test for a genealogical relation between the Panoan and Takanan language families. Scientific Reports 14:1

List, Johann-Mattis

2023. Evolutionary Aspects of Language Change. In Evolutionary Thinking Across Disciplines [Synthese Library, 478], ► pp. 103 ff.

List, Johann-Mattis

2023. Open Problems in Computational Historical Linguistics. Open Research Europe 3 ► pp. 201 ff.

List, Johann-Mattis

2024. Open Problems in Computational Historical Linguistics. Open Research Europe 3 ► pp. 201 ff.

This list is based on CrossRef data as of 8 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.