A generating model for Finnish nominal inflection using distributional semantics

Nikolaev, Alexandre; Chuang, Yu-Ying; Baayen, R. Harald

doi:10.1075/ml.22008.nik

Article published In: Explorations of morphological structure in distributional space
Edited by Melanie J. Bell, Juhani Järvikivi and Vito Pirrelli
[The Mental Lexicon 17:3] 2022
► pp. 368–394

Get fulltext from our e-platform

Download PDF

Download EPUB

A generating model for Finnish nominal inflection using distributional semantics

Alexandre Nikolaev | University of Eastern Finland

Yu-Ying Chuang | University of Tübingen

R. Harald Baayen | University of Tübingen

Available under the Creative Commons Attribution (CC BY) 4.0 license.

For any use beyond this license, please contact the publisher at rights@benjamins.nl.

Published online: 17 March 2023

https://doi.org/10.1075/ml.22008.nik

Abstract

Finnish nouns are characterized by rich inflectional variation, with obligatory marking of case and number, with optional possessive suffixes and with the possibility of further cliticization. We present a model for the conceptualization of Finnish inflected nouns, using pre-compiled fasttext embeddings (300-dimensional semantic vectors that approximate words’ meanings). Instead of deriving the semantic vector of an inflected word from another word in its paradigm, we propose that an inflected word is conceptualized by means of summation of latent vectors representing the meanings of its lexeme and its inflectional features. We tested this model on the 2,000 most frequent Finnish nouns and their inflected word forms from a corpus of Finnish (84 million tokens). Visualization of the semantic space of Finnish using t-SNE clarified that a ‘main effects’ additive model does not do justice to the semantics of inflection. In Finnish, how number is realized turns out to vary substantially with case. Further interactions emerged with the possessive suffixes and the clitics. By taking these interactions into account, the accuracy of our model, evaluated with the fasttext embeddings as gold standard, improved from 76% to 89%. Analyses of the errors made by the model clarified that 7.5% of errors are due to overabundance (and hence not true errors), and that 16.5% of the errors involved exchanges of semantically highly similar stems (lexemes). Our results indicate, first, that the semantics of Finnish noun inflection are more intricate than assumed thus far, and second, that these intricacies can be captured with surprisingly high accuracy by a simple generating model based on imputed semantic vectors for lexemes, inflectional features, and interactions of inflectional features.

Keywords: word embeddings, inflectional morphology, fasttext, word2vec, tSNE, imputed semantic vectors, Finnish language

Article outline

1.Introduction
2.Finnish noun inflection
3.Fasttext-based models of Finnish noun semantics
4.A generating model for nominal conceptualization
5.Error analysis
6.Models based on word2vec instead of on fasttext
7.Discussion
Notes
References

References (34)

References

Baayen, R. H., Chuang, Y.-Y., Shafaei-Bajestan, E., and Blevins, J. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity.

Blevins, J. P. (2016). Word and paradigm morphology. Oxford University Press.

Boleda, G. (2020). Distributional Semantics and Linguistic Theory. Annual Review of Linguistics, 61:213–234.

Booij, G. E. (1996). Inherent versus contextual inflection and the split morphology hypothesis. In Booij, G. E. and Marle, J. V., editors, Yearbook of Morphology 1995, pages 1–16. Kluwer Academic Publishers, Dordrecht.

Brunila, M. and LaViolette, J. (2022). What company do words keep? revisiting the distributional semantics of jr firth & zellig harris. arXiv preprint arXiv:2205.07750.

Bybee, J. L. (1985). Morphology: A study of the relation between meaning and form. Benjamins, Amsterdam.

Chen, J. and Chen, Z. (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771.

Chuang, Y. Y., Brown, D., Evans, R. and Baayen, R. H. (2022). Paradigm gaps are associated with weird “distributional semantics”. Russian defective nouns and their case and number paradigms.

Epskamp, S., Borsboom, D., and Fried, E. I. (2018). Estimating psychological networks and their accuracy: A tutorial paper. Behavior research methods, 50(1):195–212.

Epskamp, S., Cramer, A. O., Waldorp, L. J., Schmittmann, V. D., and Borsboom, D. (2012). qgraph: Network visualizations of relationships in psychometric data. Journal of statistical software, 481:1–18.

Firth, J. R. (1968). Selected papers of J R Firth, 1952–59. Indiana University Press.

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T. (2018). Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893.

Günther, F., Rinaldi, L., and Marelli, M. (2019). Vector-Space Models of Semantic Representation From a Cognitive Perspective: A Discussion of Common Misconceptions. Perspectives on Psychological Science, 14(6):1006–1033.

Harris, Z. S. (1954). Distributional Structure. WORD, 10(2–3).

Karlsson, F. (1983). Suomen kielen äänne-ja muotorakenne [the phonological and morphological structure of finnish]. Werner Söderström, Juva.

(1985). Paradigms and word forms. Studia gramatyczne, 71:135–154.

(1986). Frequency considerations in morphology. STUF-Language Typology and Universals, 39(1–4):19–28.

(2017). Finnish: A comprehensive grammar. Routledge.

Karlsson, F. and Koskenniemi, K. (1985). A process model of morphology and lexicon. Folia Linguistica, 291:207–231.

Krijthe, J. H. (2015). Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-Hut Implementation. R package version 0.16.

Laine, M., Kujala, P., Niemi, J., and Uusipaikka, E. (1992). On the nature of naming difficulties in aphasia. Cortex, 28(4):537–554.

Landauer, T. and Dumais, S. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2):211–240.

Marelli, M. and Baroni, M. (2015). Affixation in semantic space: Modeling morpheme meanings with compositional distributional semantics. Psychological Review, 122(3):485–515.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013 – Workshop Track Proceedings, pages 1–12.

Nikolaev, A., Ashaie, S., Hallikainen, M., Hänninen, T., Higby, E., Hyun, J., Lehtonen, M., and Soininen, H. (2019). Effects of morphological family on word recognition in normal aging, mild cognitive impairment, and alzheimer’s disease. Cortex, 1161:91–103.

Schreuder, R. and Baayen, R. H. (1997). How complex simplex words can be. Journal of Memory and Language, 371:118–139.

Shafaei-Bajestan, E., Moradipour-Tari, M., Uhrig, P., and Baayen, R. H. (2022). Semantic properties of english nominal pluralization: Insights from word embeddings. arXiv.

Shafaei-Bajestan, Elnaz, Uhrig, Peter and Baayen, R. H. (2023). Making sense of spoken plurals.

Shahmohammadi, H., Lensch, H., and Baayen, R. H. (2021). Learning zero-shot multifaceted visually grounded word embeddings via multi-task training. CoNLL 2021. arXiv preprint arXiv:2104.07500.

Sinclair, J. (1991). Corpus, concordance, collocation. Describing English language. Oxford University Press, Oxford.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.

van der Maaten, L. (2014). Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research, 151:3221–3245.

van der Maaten, L. and Hinton, G. (2008). Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 91:2579–2605.

Wang, B., Wang, A., Chen, F., Wang, Y., and Kuo, C. C. (2019). Evaluating word embedding models: Methods and experimental results. APSIPA Transactions on Signal and Information Processing, 81(May):e19.

Cited by (10)

Cited by ten other publications

Order by:

Khan, Jebran, Kashif Ahmad, Senthil Kumar Jagatheesaperumal & Kyung-Ah Sohn

2025. Textual variations in social media text processing applications: challenges, solutions, and trends. Artificial Intelligence Review 58:3

Baayen, R. Harald

2024. The wompom. Corpus Linguistics and Linguistic Theory 20:3 ► pp. 615 ff.

Hakala, Tero, Tiina Lindh-Knuutila, Annika Hultén, Minna Lehtonen & Riitta Salmelin

2024. Subword Representations Successfully Decode Brain Responses to Morphologically Complex Written Words. Neurobiology of Language 5:4 ► pp. 844 ff.

Heikkilä, Timo T., Nea Soralinna & Jukka Hyönä

2024. Relating foveal and parafoveal processing efficiency with word-level parameters in text reading. Journal of Memory and Language 137 ► pp. 104516 ff.

Herce, Borja & Marc Allassonnière-Tang

2024. The meaning of morphomes: distributional semantics of Spanish stem alternations. Linguistics Vanguard 10:1 ► pp. 115 ff.

Mujezinović, Erdin, Vsevolod Kapatsinski & Ruben van de Vijver

2024. One Cue's Loss Is Another Cue's Gain—Learning Morphophonology Through Unlearning. Cognitive Science 48:5

Nieder, Jessica, Ruben van de Vijver & Adam Ussishkin

2024. Emerging Roots: Investigating Early Access to Meaning in Maltese Auditory Word Recognition. Cognitive Science 48:11

Shafaei-Bajestan, Elnaz, Masoumeh Moradipour-Tari, Peter Uhrig & R. Harald Baayen

2024. The pluralization palette: unveiling semantic clusters in English nominal pluralization through distributional semantics. Morphology 34:4 ► pp. 369 ff.

van de Vijver, Ruben, Emmanuel Uwambayinema & Yu-Ying Chuang

2024. Comprehension and production of Kinyarwanda verbs in the Discriminative Lexicon. Linguistics 62:1 ► pp. 79 ff.

Kivisaari, Sasa L., Annika Hultén, Marijn van Vliet, Tiina Lindh-Knuutila & Riitta Salmelin

2023. Semantic feature norms: a cross-method and cross-language comparison. Behavior Research Methods 56:6 ► pp. 5788 ff.

This list is based on CrossRef data as of 27 november 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.