Keyness should integrate frequency, association, and dispersion: Not just frequency

Gries, Stefan Th.

doi:10.1075/cilt.370.02gri

In:Mathematical Modelling in Linguistics and Text Analysis: Theory and applications
Edited by Adam Pawłowski, Sheila Embleton, Jan Mačutek and Aris Xanthos
[Current Issues in Linguistic Theory 370] 2025
► pp. 17–26

Get fulltext from our e-platform

Download Book PDF

Not just frequency

Keyness should integrate frequency, association, and dispersion

Stefan Th. Gries | UC Santa Barbara | JLU Giessen

Published online: 13 October 2025

https://doi.org/10.1075/cilt.370.02gri

Abstract

For decades, nearly all approaches to keyness analysis in corpus linguistics have been based on computing for each word type in question a single statistic — usually, the log-likelihood score G² — and ranking word types by how key that statistic made a word type for a target corpus T. In this paper, I discuss a new approach to keyness that (i) uses three dimensions of information (frequency in T, association to T, and dispersion in T relative to R and that (ii) measures both association and dispersion using the information-theoretic measure of the Kullback-Leibler divergence. I outline the computational steps and provide R code in a markdown document as well as a ready-made R function Keyness3D with which readers can conduct analyses of their own data. I exemplify the use of the function and its results using the learned text category in the Brown corpus against the rest.

Keywords: keyness, log-likelihood G², frequency, association, dispersion, KL-divergence

Article outline

1.Introduction
- 1.1General introduction
- 1.2Overview of the present paper
2.Methods
- 2.1Data
- 2.2The three components of keyness
  - 2.2.1The frequency component
  - 2.2.2The association component
  - 2.2.3The dispersion component
- 2.3What to do with those values?
  - 2.3.1Keeping dimensions separate
  - 2.3.2Amalgamations
3.Case study: ‘Learned’ in Brown
4.Concluding remarks
References

References (9)

References

Egbert, Jesse & Douglas Biber. 2019. Incorporating text dispersion into keyword analyses. Corpora 14(1). 77–104.

Gries, Stefan Th. 2021. A new approach to (key) keywords analysis: using frequency, and now also dispersion. Research in Corpus Linguistics 9(2). 1–33.

. 2024. Frequency, dispersion, association, and keyness: Revising and tupleizing corpus-linguistic measures. Amsterdam: John Benjamins.

Hofland, Knut & Stig Johansson. 1982. Word frequencies in British and American English. Bergen: Norwegian Computing Centre for the Humanities.

Leech, Geoffrey & Roger Fallon. 1992. Computer corpora — What do they tell us about culture? ICAME Journal 16. 29–50.

Manly, Bryan F. J. & Jorge A. Navarro Alberto. 2016. Multivariate statistical methods: A primer. 4th. ed. Boca Raton: CRC Press.

Paquot, Magali & Yves Bestgen. 2009. Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In Andreas Jucker, Daniel Schreier, & Marianne Hundt (eds.), Corpora: Pragmatics and discourse, 247–269. Amsterdam: Rodopi.

Pojanapunya, Punjaporn & Richard Watson Todd. 2018. Log-likelihood and odds ratio: Keyness statistics for different purposes of keyword analysis. Corpus Linguistics and Linguistic Theory 14(1). 133–167.

Rayson, Paul & Amanda Potts. 2020. Analysing keyword lists. In Magali Paquot & Stefan Th. Gries (eds.), Practical handbook of corpus linguistics, 119–139. Berlin: Springer.