In:Mathematical Modelling in Linguistics and Text Analysis: Theory and applications
Edited by Adam Pawłowski, Sheila Embleton, Jan Mačutek and Aris Xanthos
[Current Issues in Linguistic Theory 370] 2025
► pp. 17–26
Not just frequency
Keyness should integrate frequency, association, and dispersion
Published online: 13 October 2025
https://doi.org/10.1075/cilt.370.02gri
https://doi.org/10.1075/cilt.370.02gri
Abstract
For decades, nearly all approaches to keyness analysis in corpus linguistics have been based on computing
for each word type in question a single statistic — usually, the log-likelihood score G2 — and
ranking word types by how key that statistic made a word type for a target corpus T. In this paper, I discuss
a new approach to keyness that (i) uses three dimensions of information (frequency in T, association to
T, and dispersion in T relative to R and that (ii) measures both
association and dispersion using the information-theoretic measure of the Kullback-Leibler divergence. I outline the
computational steps and provide R code in a markdown document as well as a ready-made R function
Keyness3D with which readers can conduct analyses of their own data. I exemplify the use of the
function and its results using the learned text category in the Brown corpus against the rest.
Keywords: keyness, log-likelihood G2, frequency, association, dispersion, KL-divergence
Article outline
- 1.Introduction
- 1.1General introduction
- 1.2Overview of the present paper
- 2.Methods
- 2.1Data
- 2.2The three components of keyness
- 2.2.1The frequency component
- 2.2.2The association component
- 2.2.3The dispersion component
- 2.3What to do with those values?
- 2.3.1Keeping dimensions separate
- 2.3.2Amalgamations
- 3.Case study: ‘Learned’ in Brown
- 4.Concluding remarks
References
References (9)
Egbert, Jesse & Douglas Biber. 2019. Incorporating
text dispersion into keyword
analyses. Corpora 14(1). 77–104.
Gries, Stefan Th. 2021. A new approach to (key)
keywords analysis: using frequency, and now also dispersion. Research in Corpus
Linguistics 9(2). 1–33.
. 2024. Frequency, dispersion,
association, and keyness: Revising and tupleizing corpus-linguistic
measures. Amsterdam: John Benjamins.
Hofland, Knut & Stig Johansson. 1982. Word
frequencies in British and American
English. Bergen: Norwegian Computing Centre for the Humanities.
Leech, Geoffrey & Roger Fallon. 1992. Computer
corpora — What do they tell us about culture? ICAME
Journal 16. 29–50.
Manly, Bryan F. J. & Jorge A. Navarro Alberto. 2016. Multivariate
statistical methods: A primer. 4th. ed. Boca Raton: CRC Press.
Paquot, Magali & Yves Bestgen. 2009. Distinctive
words in academic writing: A comparison of three statistical tests for keyword
extraction. In Andreas Jucker, Daniel Schreier, & Marianne Hundt (eds.), Corpora:
Pragmatics and
discourse, 247–269. Amsterdam: Rodopi.
