Finding structure in linguistic data: Cluster analysis

Divjak, Dagmar; Fieller, Nick

doi:10.1075/hcp.43.16div

In:Corpus Methods for Semantics: Quantitative studies in polysemy and synonymy
Edited by Dylan Glynn and Justyna A. Robinson
[Human Cognitive Processing 43] 2014
► pp. 405–441

Get fulltext from our e-platform

Download Book PDF

Cluster analysis

Finding structure in linguistic data

Dagmar Divjak | University of Sheffield

Nick Fieller | University of Sheffield

Published online: 6 November 2014

https://doi.org/10.1075/hcp.43.16div

Cluster analysis is an exploratory data analysis technique, encompassing a number of different algorithms and methods for sorting objects into groups. Cluster analysis requires the analyst to make choices about dissimilarity measures, grouping algorithms, etc., and these choices are difficult to make without an understanding of their theoretical implications and a very good understanding of the data. This chapter provides an introduction to the distance measures and clustering algorithms most commonly used for cluster analytic work. Different from Baayen (2008), Johnson (2008) and Gries (2009), its main aim is to equip the researcher with at least a basic understanding of what is happening behind the scenes when a dataset is explored with the help of a particular cluster analytic technique.

Keywords: clustering algorithms, distance measures

References (19)

References

Alviar, J.J. (2008). Recent advances in computational linguistics and their application to biblical studies. New Testament Studies , 54(1),139–159

Baayen, R.H. (2008). Analyzing linguistic data: A practical introduction to statistics using R . Cambridge: Cambridge University Press.

Backhaus, K., Erichson, B., Plinke, W., & Weiber, R. (1996). Multivariate Analysemethoden: Eine anwendungsorientierte Einführung . 8^th edition . Berlin; Heidelberg; New York: Springer.

Brock, G., Pihur, V., Datta, S., & Datta, S. (2011). clValid: Validation of clustering results. Journal of Statistical Software , 25(4), March 2008. R package version 0.6-2. [URL].

Divjak, D., & Gries, St. Th. (2006). Ways of trying in Russian: Clustering behavioral profiles. Journal of Corpus Linguistics and Linguistic Theory , 2(1), 23–60.

Everitt, B.S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis . 5^th edition . Oxford: Wiley.

Gower, J., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification , 3(1), 5–48.

Gries, St. Th. (2009). Statistics for linguistics with R: A practical introduction . Berlin: Mouton de Gruyter.

Harnad, S. (2005). To cognize is to categorize: Cognition is categorization. In C. Lefebvre & H. Cohen (Eds.), Handbook on categorization (pp. 19–43). Oxford & London: Elsevier.

Hennig, C. (2010). fpc: Flexible procedures for clustering. R package version 2.0-3. [URL].

Johnson, K. (2008). Quantitative methods in linguistics . New York: Wiley-Blackwell.

Kaufman, L., & Rousseeuw, P.J. (1990). Finding groups in data: An introduction to cluster analysis (Series in Applied Probability and Statistics) . New York: Wiley-Blackwell.

Milligan, G.W., & Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika , 50, 159–179.

R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. [URL].

Rousseeuw, P.J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics , 20(1), 53–65.

Shaw, D. (1974). Statistical analysis of dialectal boundaries. Computers and the Humanities , 8, 173–177.

Suzuki, R., & Shimodaira, H. An R package for hierarchical clustering with p-values. Retrieved from [URL] [Accessed 25 May 2012].

Tryon, R.C. (1939). Cluster analysis . New York: McGraw-Hill.

Wichern, D.W., & Johnson, R.A. (2007). Applied multivariate statistical analysis . Englewood Cliffs: Prentice-Hall.

Cited by (28)

Cited by 28 other publications

Order by:

Astobiza, Aníbal M.

2025. The Geometry of Language: Understanding LLMs in Bioethics. Journal of Bioethical Inquiry 22:3 ► pp. 573 ff.

Dai, Ying & Yicheng Wu

2024. The colexification of vision and cognition in Mandarin: controlled activity surpasses uncontrolled experience. Cognitive Linguistics 35:3 ► pp. 345 ff.

Zhang, Yixuan, Yimeng Wang, Nutchanon Yongsatianchot, Joseph D Gaggiano, Nurul M Suhaimi, Anne Okrah, Miso Kim, Jacqueline Griffin & Andrea G Parker

2024. Proceedings of the CHI Conference on Human Factors in Computing Systems , ► pp. 1 ff.

Liu, Meili

2023. Towards a dynamic behavioral profile of the Mandarin Chinese temperature termre: a diachronic semasiological approach. Corpus Linguistics and Linguistic Theory 19:2 ► pp. 289 ff.

Milin, Petar, Benjamin V. Tucker & Dagmar Divjak

2023. A learning perspective on the emergence of abstractions: the curious case of phone(me)s. Language and Cognition 15:4 ► pp. 740 ff.

Robledo, Hernán & Rogelio Nazar

2023. A proposal for the inductive categorisation of parenthetical discourse markers in Spanish using parallel corpora. International Journal of Corpus Linguistics 28:4 ► pp. 500 ff.

SUGAWARA, Yuki & Kazuho KAMBARA

2023. <i>The Many Uses of Explain:</i>. Annals of the Japan Association for Philosophy of Science 32:0 ► pp. 23 ff.

Van den Heede, Margot & Peter Lauwers

2023. Syntactic productivity under the microscope: the lexical and semantic openness of Dutch minimizing constructions. Folia Linguistica 57:3 ► pp. 723 ff.

Wu, Shuqiong & Yue Ou

2023. A quantitative study of the polysemy of Mandarin Chinese perception verb kàn ‘look/see’ . Australian Journal of Linguistics 43:3 ► pp. 191 ff.

Zhou, Jiangping

2023. A corpus-based study of explicit objective modal expressions in English. Studia Neophilologica 95:1 ► pp. 100 ff.

王, 婷

2023. The Polysemy of the Chinese Verb “Kai”: A Corpus-Based Behavioral Profile Analysis. Modern Linguistics 11:06 ► pp. 2771 ff.

Krawczak, Karolina

2022. Modeling constructional variation. In Analogy and Contrast in Language [Human Cognitive Processing, 73], ► pp. 341 ff.

Siahaan, Poppy

2022. Indonesian basic olfactory terms: more negative types but more positive tokens. Cognitive Linguistics 33:3 ► pp. 447 ff.

Wang, Jiaojiao & Jiangping Zhou

2022. A Corpus-Based Study of Semantic Categorizations of Attracted Adjectives to the it BE ADJ clause Construction. Sage Open 12:2

Torres, Peter Joseph

2021. The role of modals in policies: The US opioid crisis as a case study. Applied Corpus Linguistics 1:3 ► pp. 100008 ff.

Johansson, Marjut & Veronika Laippala

2020. Affectivity in the #jesuisCharlie Twitter discussion. Pragmatics. Quarterly Publication of the International Pragmatics Association (IPrA) 30:2 ► pp. 179 ff.

Dattner, Elitzur

2019. The Hebrew dative: Usage patterns as discourse profile constructions. Linguistics 57:5 ► pp. 1073 ff.

Proos, Mariann

2019. Polysemy of the Estonian perception verb nägema ‘to see’. In Perception Metaphors [Converging Evidence in Language and Communication Research, 19], ► pp. 231 ff.

Vandevoorde, Lore

2019. Register, Source Language, and Cognateness Effects on Lexical Choice in Translated Dutch. Meta 63:3 ► pp. 627 ff.

Brown, David West

2018. English and Empire,

Kifokeris, Dimosthenis & Yiannis Xenidis

2018. Application of Linguistic Clustering to Define Sources of Risks in Technical Projects. ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part A: Civil Engineering 4:1

Ioannou, Georgios

2017. A corpus-based analysis of the verbpleróoin Ancient Greek. Review of Cognitive Linguistics 15:1 ► pp. 253 ff.

Ioannou, Georgios

2019. From Athenian fleet to prophetic eschatology. Correlating formal features to themes of discourse in Ancient Greek. Folia Linguistica 53:s40-s2 ► pp. 355 ff.

Vandevoorde, Lore, Els Lefever, Koen Plevoets & Gert De Sutter

2017. A corpus-based study of semantic differences in translation. Target. International Journal of Translation Studies 29:3 ► pp. 388 ff.

Desagulier, Guillaume

2015. Le statut de la fréquence dans les grammaires de constructions : simple comme bonjour ?. Langages N° 197:1 ► pp. 99 ff.

Desagulier, Guillaume

2017. Clustering Methods. In Corpus Linguistics and Statistics with R [Quantitative Methods in the Humanities and Social Sciences, ], ► pp. 239 ff.

Desagulier, Guillaume

2019. Can word vectors help corpus linguists?. Studia Neophilologica 91:2 ► pp. 219 ff.

[no author supplied]

2016. Review of Moisl ((2015)): Cluster Analysis for Corpus Linguistics. International Journal of Corpus Linguistics 21:4 ► pp. 581 ff.

This list is based on CrossRef data as of 10 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.