Article published In: Journal of Historical Linguistics
Vol. 10:1 (2020) ► pp.42–86
A probabilistic assessment of the Indo-Aryan Inner–Outer Hypothesis
Published online: 25 May 2020
https://doi.org/10.1075/jhl.18038.cat
https://doi.org/10.1075/jhl.18038.cat
Abstract
This paper uses a novel data-driven probabilistic approach to address the century-old Inner-Outer hypothesis of
Indo-Aryan. I develop a Bayesian hierarchical mixed-membership model to assess the validity of this hypothesis using a large data
set of automatically extracted sound changes operating between Old Indo-Aryan and Modern Indo-Aryan speech varieties. I employ
different prior distributions in order to model sound change, one of which, the Logistic Normal distribution, has not received
much attention in linguistics outside of Natural Language Processing, despite its many attractive features. I find evidence for
cohesive dialect groups that have made their imprint on contemporary Indo-Aryan languages, and find that when a Logistic Normal
prior is used, the distribution of dialect components across languages is largely compatible with a core-periphery pattern similar
to that proposed under the Inner-Outer hypothesis.
Article outline
- 1.Introduction
- 2.Background
- 2.1Indo-Aryan dialectal variation
- 2.1.1Pre-Old Indo Aryan period
- 2.1.2Old Indo Aryan period
- 2.1.3Middle Indo Aryan period
- 2.1.4New Indo Aryan period
- 2.2Proposed Indo-Aryan dialectal groupings
- 2.1Indo-Aryan dialectal variation
- 3.Rationale
- 3.1Bayesian models in linguistics and related fields
- 3.2Operationalizing the Inner-Outer Hypothesis
- 4.Data
- 5.Modeling sound change
- 5.1Prior distributions over sound change probabilities
- 6.Generative model
- 7.Implementation and inference
- 8.Results
- 8.1Sparsity of language-group distributions
- 8.2Language-group distributions
- 8.3Sound change distributions
- 8.4Posterior predictive checks
- 8.4.1Entropy
- 8.4.2Accuracy
- 9.Discussion and outlook
- 10.Conclusion
- Acknowledgements
- Notes
- Appendix (supplementary material)
- Appendix (supplementary material)
- Dirichlet model sound change probabilities
- Logistic normal model sound change probabilities
- Accuracy scores for sound change distributions for simulated data
- Appendix (supplementary material)
References
References (91)
Aitchison, John. 1986. The Statistical Analysis of Compositional Data. London & New York: Chapman & Hall.
Blei, David M., Alp Kucukelbir & Jon D. McAuliffe. 2017. Variational Inference: A Review for Statisticians. Journal of the American Statistical Association 112:518.859–877.
Blei, David M. & John D. Lafferty. 2007. A Correlated Topic Model of Science. The Annals of Applied Statistics 1:1.17–35.
Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 31.993–1022.
Bouchard-Côté, Alexandre, Thomas L. Griffiths & Dan Klein. 2009. Improved Reconstruc-tion of Protolanguage Word Forms. Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL, 65–73. Boulder, CO: Association for Computational Linguistics.
Bouchard-Côté, Alexandre, David Hall, Thomas L. Griffiths & Dan Klein. 2013. Auto-mated Reconstruction of Ancient Languages using Probabilistic Models of Sound Change. Proceedings of the National Academy of Sciences 1101.4224–4229.
Bouchard-Côté, Alexandre, Percy S. Liang, Thomas L. Griffiths & Dan Klein. 2007. A Probabilistic Approach to Diachronic Phonology. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 887–896. Prague: Association for Computational Linguistics.
Bouchard-Côté, Alexandre, Percy S. Liang, Dan Klein & Thomas L. Griffiths. 2008. A Probabilistic Approach to Language Change. Advances in Neural Information Processing Systems, 169–176.
Box, George E. P. 1980. Sampling and Bayes’ Inference in Scientific Modelling and Robustness. Journal of the Royal Statistical Society. Series A (General) 1431.383–430.
Burrow, Thomas. 1975. A New Look at Brugmann’s Law. Bulletin of the School of Oriental and African Studies 38:1.55–80.
Cardona, George & Dhanesh Jain. 2007. General Introduction. The Indo-Aryan Languages ed. by George Cardona & Dhanesh Jain, 2–45. London: Routledge.
Carpenter, Bob, Andrew Gelman, Matthew D. Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li & Allen Riddell. 2017. Stan: A Probabilistic Programming Language. Journal of Statistical Software 761.
Chang, Will & Lev Michael. 2014. A Relaxed Admixture Model of Language Contact. Language Dynamics and Change 4:1.1–26.
Chatterji, Suniti Kumar. 1926. The Origin and Development of the Bengali Language. Calcutta: Calcutta University Press.
Cohen, Shay B., Kevin Gimpel & Noah A. Smith. 2009. Logistic Normal Priors for Unsu-pervised Probabilistic Grammar Induction. In Advances in Neural Information Processing Systems, 321–328.
Cohen, Shay B. & Noah A. Smith. 2009. Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 74–82. Boulder, CO: Association for Computational Linguistics.
Deo, Ashwini. 2018. Dialects in the Indo-Aryan landscape. The Handbook of Dialectology ed. by Charles Boberg, John Nerbonne & Dominic Watt, 535–546. Oxford: John Wiley & Sons.
Elizarenkova, T. Y. 1989. About Traces of a Prakrit Dialectal Basis in the Language of the Rgveda. Dialectes dans les littératures indo-aryennes ed. by Colette Caillat, 1–18. Paris: Collège de France.
Emeneau, Murray B. 1966. The Dialects of Old-Indo-Aryan. Ancient Indo-European dialects ed. by Jaan Puhvel, 123–138. Berkeley: University of California Press.
Frisk, Hjalmar. 1991. Griechisches etymologisches Wörterbuch. Band II: Kρ–Ω. Heidelberg: Carl Winter.
Fritz, Sonja. 2002. The Dhivehi Language: a Descriptive and Historical Grammar of Maldivian and its Dialects. 21 vols. Heidelberg: Ergon.
Gelman, Andrew, Xiao-Li Meng & Hal Stern. 1996. Posterior Predictive Assessment of Model Fitness via Realized Discrepancies. Statistica Sinica 61.733–760.
Gelman, Andrew & Donald B. Rubin. 1992. Inference from Iterative Simulation Using Multiple Sequences. Statistical Science 7:4.457–472.
Geman, Stuart & Donald Geman. 1984. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence 61.721–741.
Hammarström, Harald, Robert Forkel & Martin Haspelmath. 2017. Glottolog 3.3. Max Planck Institute for the Science of Human History. [URL]
von Hinüber, Oskar. 2001. Das ältere Mittelindisch im Überblick. Vienna: Verlag der Österreichischen Akademie der Wissenschaften.
Hock, Hans Henrich. 2016. The Languages, their Histories, and their Genetic Classification. The Languages and Linguistics of South Asia: A Comprehensive Guide ed. by Hans Henrich Hock & Elena Bashir, 9–240. Berlin & Boston: De Gruyter.
Jäger, Gerhard. 2013. Phylogenetic Inference from Word Lists using Weighted Alignment with Empirically Determined Weights. Language Dynamics and Change 31.245–291.
Jamison, Stephanie W. 1988. The Quantity of the Outcome of Vocalized Laryngeals in Indic. Die Laryngaltheorie und die Rekonstruktion des indogermanischen Laut- und Formensystems ed. by Alfred Bammesberger, 213–226. Heidelberg: Carl Winter.
Jeffers, Robert J. 1976. The Position of the Bihārī Dialects in Indo-Aryan. Indo-Iranian Journal 18:3–4.215–225.
Joshi, S. D. 1989. Patañjali’s Views on Apaśabdas. Dialectes dans les littératures indo-aryennes ed. by Colette Caillat, 267–294. Paris: Collège de France.
Kingma, Diederik P. & Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (ICLR).
Kingma, Diederik P. & Adam Welling. 2013. Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR).
Koskenniemi, Kimmo. 2017. Aligning Phonemes using Finite-State Methods. Proceedings of the 21st Nordic Conference of Computational Linguistics, 56–64. Gothenburg: Linköping University Electronic Press.
Kucukelbir, Alp, Dustin Tran, Rajesh Ranganath, Andrew Gelman & David M. Blei. 2017. Automatic Differentiation Variational Inference. The Journal of Machine Learning Research 18:1.430–474.
Kümmel, Martin. 2015. Developments in the Dissolution of the Indo-Iranian Accentual System. Paper presented at the Workshop on Diachronic Morphophonology: Lexical Accent Systems at the 22nd International Conference on Historical Linguistics. Naples, July 27–31.
Lipp, Reiner. 2009. Die indogermanischen und einzelsprachlichen Palatale im Indoiranischen. 21 vols. Heidelberg: Carl Winter.
List, Johann-Mattis. 2012. SCA. Phonetic Alignment based on Sound Classes. New Directions in Logic, Language, and Computation ed. by M. Slavkovik & D. Lassiter, 32–51. Berlin & Heidelberg: Springer.
MacKenzie, David Neil. 1961. The Origins of Kurdish. Transactions of the Philological Society 68–86.
Marr, David. 1982. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. San Francisco: W. H. Freeman.
Mayrhofer, Manfred. 1989–2001. Etymologisches Wörterbuch des Altindoarischen. Heidelberg: Carl Winter.
Meylan, Stephan, Michael Frank & Roger Levy. 2013. Modeling the Development of Deter-miner Productivity in Children’s Early Speech. Proceedings of the Annual Meeting of the Cognitive Science Society 351.3032–3037.
Meylan, Stephan C., Michael C. Frank, Brandon C. Roy & Roger Levy. 2017. The Emergence of an Abstract Grammatical Category in Children’s Early Speech. Psychological Science 28:2.181–192.
Mimno, David, David M. Blei & Barbara E. Engelhardt. 2015. Posterior Predictive Checks to Quantify Lack-of-Fit in Admixture Models of Latent Population Structure. Proceedings of the National Academy of Sciences 112:26.E3441–E3450.
Mimno, David, Hanna Wallach & Andrew McCallum. 2008. Gibbs Sampling for Logistic Normal Topic Models with Graph-Based Priors. NIPS Workshop on Analyzing Graphs, 1–8.
Needleman, Saul B. & Christian D. Wunsch. 1970. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 481.443–53.
Norton, Richard A., J. Andrés Christen & Colin Fox. 2017. Sampling Hyperparameters in Hierarchical Models: Improving on Gibbs for High-Dimensional Latent Fields and Large Datasets. Communications in Statistics-Simulation and Computation 471.2639–2655.
Oberlies, Thomas. 2001. Pali: A Grammar of the Language of the Theravada Tipitaka. With a Concordance to Pischel’s Grammatik der Prakrit-Sprachen. Berlin: de Gruyter.
Parkes, Peter. 1987. Livestock Symbolism and Pastoral Ideology among the Kafirs of the Hindu Kush. Man 221.637–660.
Parpola, Asko. 2002. Pre-Proto-Iranians of Afghanistan as Initiators of Śākta Tantrism: on the Scythian/Saka Affiliation of the Dāsas, Nuristanis and Magadhans. Iranica Antiqua 371.233–324.
Peterson, John. 2017. Fitting the Pieces Together: Towards a Linguistic Prehistory of Eastern-Central South Asia (and beyond). Journal of South Asian Languages and Linguistics 41.211–257.
Pritchard, Jonathan K., Matthew Stephens & Peter Donnelly. 2000. Inference of Population Structure using Multilocus Genotype Data. Genetics 155:2.945–959.
Ranganath, Rajesh, Linpeng Tang, Laurent Charlin & David Blei. 2015. Deep Exponential Families. Proceedings of the 18th International Conference on Artificial intelligence and statistics (AISTATS), 762–771. San Diego, CA.
Rasmussen, C. E. & C. K. I. Williams. 2006. Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press.
Reesink, Ger, Ruth Singer & Michael Dunn. 2009. Explaining the Linguistic Diversity of Sahul using Population Models. PLoS Biology 7.e1000241.
Rix, Helmut, Martin Kimmel, Thomas Zehnder, Reiner Lipp & Brigitte Schirmer eds. 2001. Lexikon der indogermanischen Verben: Die Wurzeln und ihre Primärstammbildungen. 2nd ed. Wiesbaden: Ludwig Reichert.
Salvatier, John, Thomas V. Wiecki & Christopher Fonnesbeck. 2016. Probabilistic Program-ming in Python using PyMC3. Peer J Computer Science 2.e55.
Shaked, Shaul. 1969. Notes on the New Aśoka Inscription from Kandahar. Journal of the Royal Asiatic Society 101:2.118–122.
Slaje, Walter. 2014. Kingship in Kaśmīr (AD 1148–1459). Halle an der Saale: Universitätsverlag Halle-Wittenberg.
Smith, Caley. 2017. The Dialectology of Indic. Handbook of Comparative and Historical Indo-European Linguistics ed. by Jared Klein, Brian Joseph & Matthias Fritz, 417–447. Berlin & Boston: De Gruyter.
Srivastava, Akash & Charles Sutton. 2017. Autoencoding Variational Inference for Topic Models. In International Conference on Learning Representations (ICLR).
Syrjänen, Kaj, Terhi Honkola, Jyri Lehtinen, Antti Leino & Outi Vesakoski. 2016. Ap-plying Population Genetic Approaches within Languages: Finnish Dialects as Linguistic Populations. Language Dynamics and Change 61.235–283.
Tedesco, P. 1960. Notes to Mayrhofer’s Etymological Sanskrit Dictionary. Journal of the American Oriental Society 80:4.360–366.
. 1965. Turner’s Comparative Dictionary of the Indo-Aryan Languages. Journal of the American Oriental Society 851.368–383.
Teh, Yee Whye, Michael I. Jordan, Matthew J. Beal & David M. Blei. 2005. Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes. In Advances in Neural Information Processing Systems, 1385–1392.
Thiel-Horstmann, Monika. 1978. On RJ Jeffers: ‘The Position of the Bihārī Dialects in Indo-Aryan’ – A Phonological Reconsideration. Indo-Iranian Journal 20:1–2.61–82.
Tran, Dustin, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy & David M. Blei. 2017. Deep Probabilistic Programming. arXiv preprint arXiv:1701.03757.
Turner, Ralph L. 1962–1966. A Comparative Dictionary of Indo-Aryan Languages. London: Oxford University Press.
1916. The Indo-Germanic Accent in Marathi. The Journal of the Royal Asiatic Society of Great Britain and Ireland 203–251.
Wieling, Martijn, Eliza Margaretha & John Nerbonne. 2012. Inducing a Measure of Phonetic Similarity from Pronunciation Variation. Journal of Phonetics 40:2.307–314.
Williamson, Sinead, Chong Wang, Katherine A. Heller & David M. Blei. 2010. The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling. Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel.
Witzel, Michael. 1989. Tracing the Vedic Dialects. Dialectes dans les littératures indo-aryennes ed. by Colette Caillat, 97–266. Paris: Collège de France.
Yanovich, Igor. 2016. Old English *motan, Variable-Force Modality, and the Presupposition of Inevitable Actualization. Language 92:3.489–521.
Zoller, Claus Peter. 1988. Bericht über besondere Archaismen im Bangani, einer Western Pahari-Sprache. Münchener Studien zur Sprachwissenschaft 491.173–200.
. 1989. Bericht über grammatische Archaismen im Bangani. Münchener Studien zur Sprachwissenschaft 501.159–218.
. 2012. Garhwali and the History of Indo-Aryan: Some Observations. Paper presented at Hindi Diwas (Day of Hindi). Uppsala, 14 September.
Cited by (3)
Cited by three other publications
Cathcart, Chundra
Ranacher, Peter, Nico Neureiter, Rik van Gijn, Barbara Sonnenhauser, Anastasia Escher, Robert Weibel, Pieter Muysken & Balthasar Bickel
This list is based on CrossRef data as of 13 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
