Article published In: International Journal of Corpus Linguistics
Vol. 28:3 (2023) ► pp.318–343
Assessing word commonness
Adding dispersion to frequency
Published online: 25 November 2022
https://doi.org/10.1075/ijcl.21037.eke
https://doi.org/10.1075/ijcl.21037.eke
Abstract
The article investigates the two main corpus indicators of word commonness, frequency and dispersion, through a
cross-validation analysis of frequency and four dispersion measures (‘Range’, ‘Chi-squared’, ‘Deviation of Proportions’ and
‘Juilland’s D’). The approach provides an estimation of the capacity of the named measures to predict the distribution of corpus
items in an extracted language sample. Based on a dataset of 273 Norwegian compounds, the results show that especially Deviation
of Proportions is a robust measure of dispersion that can be used in conjunction with frequency to substantiate assertions of word
commonness based on corpus data. In addition, dispersion measures do not only reflect what sort of distribution the frequency
statistic is generated from, but also how reliable the frequency estimation in the corpus sample is in terms of giving an accurate
representation of frequency in the language variety that the corpus is sampled from.
Keywords: dispersion, frequency, word commonness, cross-validation, lexicography
Article outline
- 1.Introduction
- 2.Dictionaries and distributions
- 2.1Dictionaries and core vocabulary
- 2.2The case of compounds in Norwegian
- 2.3Disadvantages of frequency, advantages of dispersion
- 2.4Comparing measures
- 2.4.1Frequency
- 2.4.2Relative range
- 2.4.3Chi squared χ2
- 2.4.4Deviation of proportions (DP)
- 2.4.5Juilland’s D (D_uneq)
- 2.4.6Which one(s) to choose?
- 2.5What are distributions in corpora supposed to tell us?
- 3.Methodology
- 3.1Cross-validation and correlation analysis
- 4.Results and analysis
- 4.1Frequency
- 4.2Range
- 4.3Chi squared
- 4.4Deviation of proportions
- 4.5Juilland’s D
- 5.Summary and concluding discussion
- Notes
References
References (17)
Baayen, R. H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge University Press.
Bakken, K. (1998). Leksikalisering av sammensetninger: en studie av leksikaliseringsprosessen belyst ved et gammelnorsk diplommateriale fra 1300-tallet [Lexicalisation of compounds: A study of the process of lexicalisation in light of a Norse diploma from 14th century]. [Doctoral dissertation, University of Oslo]. Acta Humaniora.
Balota, D. A., & Spieler, D. H. (1998). The utility of item-level analyses in model evaluation: A reply to Seidenberg and Plaut. Psychological Science, 9(3), 238–240.
Biber, D., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non)utility of Juilland’s D to measure lexical dispersion in large corpora. International Journal of Corpus Linguistics, 21(4), 439–464.
CLARINO UiB Portal. (2020). Norwegian Newspaper Corpus Bokmål. Created by Norsk aviskorpus. Retrieved February 23, 2021, from [URL]
Durkin, P. (2016). Introduction. In P. Durkin (Ed.), The Oxford Handbook of Lexicography. Oxford University Press.
Egbert, J., Burch, B., & Biber, D. (2020). Lexical dispersion and corpus design. International Journal of Corpus Linguistics, 25(1), 89–115.
Fellbaum, C. D. (2015). The treatment of multi-word units in lexicography. In P. Durkin (Ed.), The Oxford Handbook of Lexicography (pp. 411–424). Oxford University Press.
Fjeld, R. V., Nøklestad, A., & Hagen, K. (2020). Leksikografisk bokmålskorpus (LBK) – bakgrunn og bruk [Lexicographic corpus of Bokmål (LBK) – background and usage]. In J. B. Johannessen & K. Hagen (Eds.), Leksikografi og korpus. En hyllest til Ruth Vatvedt Fjeld, Oslo Studies in Language 11(1) (pp. 47–59).
Glossa. (2020). Leksikografisk bokmålskorpus (LBK). Retrieved February 23, 2021, from [URL]
Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437.
(2010). Dispersions and adjusted frequencies in corpora: Further explorations. In S. T. Gries, S. Wulff, & M. Davies (Eds.), Corpus Linguistic Applications: Current Studies, New Directions (pp. 197–112). Rodopi.
(2020). Analyzing dispersion. In M. Paquot & S. T. Gries (Eds.), A Practical Handbook of Corpus Linguistics. Springer.
Lyne, A. A. (1985). The Vocabulary of French Business Correspondence: Word Frequencies, Collocations and Problems of Lexicometric Method. Slatkine-Champion.
R Core Team. (2020). R: A language and environment for statistical computing (Version 4.1.1). R Foundation for Statistical Computing. [URL]
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
