Article published In: International Journal of Corpus Linguistics
Vol. 20:1 (2015) ► pp.30–53
Evaluating reliability in quantitative vocabulary studies
The influence of corpus design and composition
Published online: 30 March 2015
https://doi.org/10.1075/ijcl.20.1.02mil
https://doi.org/10.1075/ijcl.20.1.02mil
Recent methodological advances have been used to create word lists based on large corpora. The present paper explores whether these corpora — and the associated lists — are unequivocally more representative. Corpus design considerations have usually focused on issues of external representativeness (representing the target discourse domain), while disregarding issues of internal representativeness (whether the corpus permits reliable descriptions of linguistic variation). This disregard may be especially problematic for studies of lexical variation, where it is difficult to achieve stable, reliable results from corpus analysis. The present paper illustrates these challenges through experiments based on analysis of a corpus representing a highly restricted discourse domain: university-level introductory psychology textbooks. The results indicate that corpus design and composition has a much greater influence on lexical variation than previously recognized, highlighting the need to evaluate internal representativeness in quantitative corpus-based research.
References (44)
Adolphs, S., & Schmitt, N. (2003). Lexical coverage of spoken discourse. Applied Linguistics, 24(4), 425–438.
Belica, C. (1996). Analysis of temporal change in corpora. International Journal of Corpus Linguistics, 1(1), 61–74.
Biber, D. (1990). Methodological issues regarding corpus-based analyses of linguistic variation. Literary and Linguistic Computing, 5(4), 257–269.
Biber, D., Conrad, S., & Cortes, V. (2004).
If you look at…: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus Linguistics: Investigating Structure and Use. Cambridge, UK: Cambridge University Press.
Biber, D., Conrad, S., Reppen, R., Byrd, P., Helt, M., Clark, V., Cortes, V., Csomay, E., & Urzua, A. (2004). Representing Language Use in the University: Analysis of the TOEFL 2000 Spoken and Written Academic Language Corpus. Princeton, NJ: Educational Testing Service.
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. New York, NY: Longman.
Brezina, V., & Gablasova, D. (2013). Is there a core general vocabulary? Introducing the New General Service List. Applied Linguistics, 1(23). Retrieved from [URL]
Carroll, J.B., Davies, P., & Richman, B. (1971). The American Heritage Word Frequency Book. . New York, NY: American Heritage.
The College Board. (2010). CLEP® Introductory Psychology: At a Glance. Retrieved from [URL]
Covington, M., & McFall, J. (2010). Cutting the Gordian knot: The moving-average type-token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100.
Davies, M. (2010). The Corpus of Contemporary American English as the first reliable monitor corpus of English. Literary and Linguistic Computing, 25(4), 447–464.
Davies, M., & Gardner, D. (2010). A Frequency Dictionary of Contemporary American English. New York, NY: Routledge.
Francis, W.N., & Kucera, H. (1979). Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for Use with Digital Computers. Department of Linguistics, Brown University, Providence, RI. Retrieved from [URL]
Gardner, D., & Davies, M. (2013). A new academic vocabulary list. Applied Linguistics, 1(24). Retrieved from [URL]
Gries, S. Th. (2006). Exploring variability within and between corpora: Some methodological considerations. Corpora, 1(2), 109–151.
Heatley, A., & Nation, P. (1994). Range [Web-based tool]. Victoria University of Wellington, NZ. Available from [URL]
Hyland, K. (2008). Academic clusters: Text patterning in published and postgraduate writing. International Journal of Applied Linguistics, 18(1), 41–62.
Juilland, A., & Chang-Rodríguez, E. (1964). Frequency Dictionary of Spanish Words. London, UK: Mouton & Co.
Leech, G. (1991). The state of the art in corpus linguistics. In K. Aijmer & B. Altenberg (Eds.), English Corpus Linguistics (pp. 8–29). London, UK: Longman.
. (2007). New resources, or just better old ones? The Holy Grail of representativeness. In M. Hundt, N. Nesselhauf & C. Biewer (Eds.), Corpus Linguistics and the Web (pp. 133–149). Amsterdam, Netherlands: Rodopi.
Leech, G., Rayson, P., & Wilson, A. (2001). Word Frequencies in Written and Spoken English: Based on the British National Corpus. London, UK: Longman.
Martínez, I., Beck, S., & Panza, C. (2009). Academic vocabulary in agricultural research articles: A corpus-based study. English for Specific Purposes, 28(3), 183–198.
McEnery, T., & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge, UK: Cambridge University Press.
McEnery, T., & Wilson, A. (1996). Corpus Linguistics. Edinburgh, Scotland: Edinburgh University Press.
McEnery, T., Xiao, R., & Tono, Y. (2006). Corpus-based Language Studies: An Advanced Resource Book. New York, NY: Routledge.
Millar, N., & Budgell, B. (2008). The language of public health: A corpus-based analysis. Journal of Public Health, 16(5), 369–374.
Miller, D. (2012). The Challenge of Constructing a Reliable Word List: An Exploratory Corpus-based Analysis of Introductory Psychology textbooks. (Unpublished doctoral dissertation). Northern Arizona University, Flagstaff, AZ.
Nation, I.S.P. (2001). Learning Vocabulary in Another Language. Cambridge, UK: Cambridge University Press.
Nation, I.S.P., & Waring, R. (1997). Vocabulary size, text coverage and word lists. In N. Schmitt & M. McCarthy (Eds.), Vocabulary: Description, Acquisition and Pedagogy (pp. 6–19). Cambridge, UK: Cambridge University Press.
Simpson-Vlach, R. & Ellis, N. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512.
Schmitt, N. (2010). Researching Vocabulary: A Vocabulary Research Manual. New York, NY: Palgrave Macmillan.
Thorndike, D.L., & Lorge, I. (1944). The Teacher’s Word Book of 30,000 Words. New York, NY: Bureau of Publications, Teachers College, Columbia University.
Tuldava, J. (1995). On the relation between text length and vocabulary size. In J. Tuldava (Ed.), Methods in Quantitative Linguistics (pp. 131–149). Trier, Germany: Wissenschaftlicher Verlag Trier (WVT).
Tweedie, F., & Baayen, H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(1), 323–53.
Wang, J., Liang, S., & Ge, G. (2008). Establishment of a medical academic word list. English for Specific Purposes, 27(4), 442–458.
Ward, J. (2009). A basic engineering English word list for less proficient foundation engineering undergraduates. English for Specific Purposes, 28(3), 170–182.
Cited by (40)
Cited by 40 other publications
Brooks, Gavin, Jon Clenton & Simon Fraser
Egbert, Jesse, Douglas Biber, Bethany Gray & Tove Larsson
2025. Achieving stability in corpus-based analysis of word types. International Journal of Corpus Linguistics 30:2 ► pp. 150 ff.
Lin, Qiao & Hua Liu
Lu, Cailing & Averil Coxhead
Rojo, Guillermo
Appel, Randy, Joe Geluso & Hui-Hsien Feng
Ballance, Oliver J. & Averil Coxhead
Kemp, Jenny
Drayton, Jenny & Averil Coxhead
Nguyen, Thi My Hang & Averil Coxhead
2023. Evaluating multiword unit word lists for academic purposes. ITL - International Journal of Applied Linguistics 174:1 ► pp. 83 ff.
Szudarski, Paweł
Wolfer, Sascha, Alexander Koplenig, Marc Kupietz & Carolin Müller-Spitzer
Kamrotov, Mikhail, Ekaterina Talalakina & Denis Stukal
Michell, Colin
Naismith, Ben, Alan Juffs, Na-Rae Han & Daniel Zheng
Pinchbeck, Geoffrey G., Dale Brown, Stuart Mclean & Brandon Kramer
Xodabande, Ismail , Shima Torabzadeh, Mohammad Ghafouri & Azadeh Emadi
Beliaeva, Tatiana Rafaelovna
Dong, Luobing, Qiumin Guo, Weili Wu & Meghana N. Satpute
Karlińska, Agnieszka
Karlińska, Agnieszka
Miller, Don
Miller, Don
Pan, Fan
TALALAKINA, EKATERINA, DENIS STUKAL & MIKHAIL KAMROTOV
Bentum, Martijn, Louis ten Bosch, Antal van den Bosch & Mirjam Ernestus
2019. Do speech registers differ in the predictability of words?. International Journal of Corpus Linguistics 24:1 ► pp. 98 ff.
Conrad, Susan
2019. Register in English for Academic Purposes and English for Specific Purposes. Register Studies 1:1 ► pp. 168 ff.
Coxhead, Averil, Emma McLaughlin & Aleeshea Reid
Liang, Linxin & Mingwu Xu
2019. An exploratory study of Chinese words and phrases. Babel. Revue internationale de la traduction / International Journal of Translation 65:1 ► pp. 96 ff.
Bruce, Tayyiba
Green, Clarence & James Lambert
Jakobsen, Anne Sofie, Averil Coxhead & Birgit Henriksen
안의정
Pan, Fan, Randi Reppen & Douglas Biber
Pan, Fan, Randi Reppen & Douglas Biber
2020. Methodological issues in contrastive lexical bundle research. International Journal of Corpus Linguistics 25:2 ► pp. 216 ff.
Parodi, Giovanni
2015. Variation across university genres in seven disciplines. International Journal of Corpus Linguistics 20:4 ► pp. 469 ff.
[no author supplied]
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
