Article published In: Reproducibility, Replicability, and Robustness in Corpus Linguistics
Edited by Martin Schweinberger and Michael Haugh
[International Journal of Corpus Linguistics 30:2] 2025
► pp. 150–170
Achieving stability in corpus-based analysis of word types
Published online: 20 May 2025
https://doi.org/10.1075/ijcl.24109.egb
https://doi.org/10.1075/ijcl.24109.egb
Abstract
Rank-ordered lists of word types are ubiquitous in corpus linguistics and applied linguistics. Word lists are
commonly developed as aids for language teaching and learning, vocabulary testing, and language description. Yet, these lists are
often produced and used without evaluation of their stability — or replicability — across corpus samples. Our primary
objective in this paper is to describe the cumulative state of knowledge regarding the stability of corpus-based word type lists,
focusing on three goals that motivate the creation and use of rank-ordered lists: identifying key lexical items for learning or
teaching, assessing vocabulary size or knowledge, and identifying all items in a language domain. We show that word type lists are
far less stable than researchers and practitioners often assume, although there is substantial variability in stability depending
on the goals and methods behind list creation.
Keywords: word types, ranking, vocabulary lists, word frequency, stability
Article outline
- 1.Introduction
- 1.1Corpus representativeness and generalizability
- 1.2Stability and rank-ordered type lists
- 1.3Common goals for type analysis in corpus linguistic research
- 1.4Aims of the paper
- 2.Goal 1: Identify a list of lexical items that are important to learn or teach
- 3.Goal 2: Assess vocabulary size or knowledge
- 4.Goal 3: Identify all of the lexical items in a language domain
- 5.Discussion and conclusion
- Notes
References
References (44)
Anthony, L. (2024). AntConc (Version
4.3.1) [Computer software]. Waseda University. [URL]
Baker, P. (2004). Querying
keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of
English
linguistics, 32(4), 346–359.
Baroni, M. (2009). Distributions
in text. In A. Lüdeling and M. Kytö (Eds.), Corpus
linguistics: An international
handbook (Vol. 21, pp. 803–822). Mouton de Gruyter.
Biber, D., Conrad, S., & Cortes, V. (2004). If
you look at…: Lexical bundles in university teaching and textbooks. Applied
Linguistics, 25(3), 371–405.
Biber, D., Johansson, S., Leech, G. N., Conrad, S., Finegan, E., & Quirk, R. (2021). Grammar
of spoken and written English. John Benjamins.
Brezina, V., & Gablasova, D. (2015). Is
there a core general vocabulary? Introducing the New General Service List. Applied
Linguistics, 36(1), 1–22.
Brezina, V., Weill-Tessier, P., & McEnery, A. (2021). #LancsBox (Version
6) [Computer software]. [URL]
Brown, D. (2017). Coverage-based
frequency bands: A proposal. Vocabulary Learning and
Instruction, 6(2), 52–60.
Browne, C. (2014). A
new general service list: The better mousetrap we’ve been looking for? Vocabulary learning and
Instruction, 3(2), 1–10.
Burch, B., & Egbert, J. (2022). Word
use equivalence and hierarchical word tiers. Journal of Quantitative
Linguistics, 30(1), 104–124.
Carroll, J. B., Davies, P., & Richman, B. (1971). The
American heritage word frequency book. Houghton Mifflin.
Cobb, T. (n.d.). Compleat
Lexical Tutor (Version 8.5) [Computer software]. [URL]
Coxhead, A., Nation, I. S. P., & Sim, D. (2012). Creating
and trialling six versions of the Vocabulary Size Test. TESOLANZ
Journal, 221, 13–27.
Davies, M. (n.d.). English-Corpora.org [Computer
software]. [URL]
Egbert, J. (2021). Lexical
prominence and the cline of linguistic interpretability. [URL].
Egbert, J., & Biber, D. (2019). Incorporating
text dispersion into keyword
analyses. Corpora, 14(1), 77–104.
Egbert, J., Biber, D., & Gray, B. (2022). Designing
and evaluating language corpora: A practical framework for corpus representativeness. Cambridge University Press.
Egbert, J., Larsson, T., & Biber, D. (2020). Doing
linguistics with a corpus: Methodological considerations for the everyday user. Cambridge University Press.
Ellis, N. C. (2012). Formulaic
language and second language acquisition: Zipf and the phrasal teddy bear. Annual Review of
Applied
Linguistics, 321, 17–44.
Francis, W. N., Kučera, H., Mackie, A. W. (1982). Frequency
analysis of English usage: Lexicon and grammar. Houghton Mifflin.
Gardner, D., & Davies, M. (2014). A
new academic vocabulary list. Applied
Linguistics, 35(3), 305–327.
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P., Suchomel, V. (2014). The
Sketch Engine: Ten years
on. Lexicography, 11: 7–36.
Leech, G., Rayson, P., & Wilson, A. (2014). Word
frequencies in written and spoken English: Based on the British National
Corpus. Routledge.
Miller, D. (2022). Replication
as a means of assessing corpus representativeness and the generalizability of specialized word
lists. Applied Corpus
Linguistics, 2(3), 100027.
Miller, D., & Biber, D. (2015). Evaluating
reliability in quantitative vocabulary studies: The influence of corpus design and
composition. International Journal of Corpus
Linguistics, 20(1), 30–53.
Nation, I. S. P. (2004). A
study of the most frequent word families in the British National
Corpus. In P. Bogaards & B. Laufer (Eds.), Vocabulary
in a second language: Selection, acquisition, and
testing (pp. 3–13). John Benjamins.
(2016). Making
and using word lists for language learning and testing. John Benjamins.
Pan, F., Reppen, R., & Biber, D. (2016). Comparing
patterns of L1 versus L2 English academic professionals: Lexical bundles in Telecommunications research
journals. Journal of English for Academic
Purposes, 211, 60–71.
Renouf, A. (2012). A
finer definition of neology in English: The life-cycle of a
word. In H. Hasselgård, J. Ebeling, & S. Oksefjell Ebeling (Eds.), Corpus
perspectives on patterns of
lexis (pp. 177–208). John Benjamins.
(2016). Enhancing
language teaching: How corpus linguistics can help. Corpus Linguistics
Research, 21, 25–32.
Schmitt, N., & Schmitt, D. (2014). A
reassessment of frequency and vocabulary size in L2 vocabulary teaching. Language
Teaching, 47(4), 484–503.
Scott, M. (2024). WordSmith
Tools (Version 9.0) [Computer
software]. Lexical Analysis Software. [URL]
Shin, D., & Nation, P. (2008). Beyond
single words: The most frequent collocations in spoken English. ELT
Journal, 62(4), 339–348.
Simpson-Vlach, R., & Ellis, N. C. (2010). An
academic formulas list: New methods in phraseology research. Applied
Linguistics, 31(4), 487–512.
Tanaka‐Ishii, K., & Terada, H. (2011). Word
familiarity and frequency. Studia
Linguistica, 65(1), 96–116.
West, M. (Ed.). (1953). A
general service list of English words: With semantic frequencies and a supplementary word-list for the writing of popular
science and technology. Longman.
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
