Article published In: International Journal of Corpus Linguistics
Vol. 27:3 (2022) ► pp.291–320
Handle it in-house?
Learner corpora frequency lists and lexical sophistication
Published online: 20 May 2022
https://doi.org/10.1075/ijcl.20024.nai
https://doi.org/10.1075/ijcl.20024.nai
Abstract
Vocabulary lists of high-frequency lexical items are an important resource in language education and a key product of corpus research. However, no single vocabulary list will be useful for every learning context, with the appropriateness of such lists affected by the corpora on which they are based. This paper investigates the impact of corpus selection on one measure of lexical sophistication, Advanced Guiraud, focusing on two frequency lists originating from an in-house learner corpus (PELIC) and a global learner corpus (Cambridge Learner Corpus). This analysis shows that frequency lists derived from both types of learner corpus can effectively serve as the basis for measuring the development of lexical sophistication, regardless of the specific program of the learners. Therefore, publicly available learner corpus frequency lists can be a reliable resource for stakeholders interested in the lexical gains of language learners.
Article outline
- 1.Introduction
- 2.Learner corpora (LC) and lexical sophistication
- 2.1In-house corpora: The University of Pittsburgh English Language Institute Corpus (PELIC)
- 2.2Global corpora: ETS Corpus of Non-Native Written English (ETS)
- 2.3Lexical sophistication
- 2.4Motivation for the current study
- 3.Methodology
- 3.1Frequency lists
- 3.2ETS comparison
- 3.3Data collection and description
- 3.4Comparison with lexical diversity
- 4.Results
- 4.1Lexical sophistication descriptive statistics
- 4.2Lexical sophistication inferential statistics
- 4.3AG comparison to vocD
- 4.4Results summary
- 5.Discussion
- 6.Conclusion
- Notes
References
References (54)
Alexopoulou, T., Geertzen, J., Korhonen, A., & Meurers, D. (2015). Exploring big educational learner corpora for SLA research: Perspectives on relative clauses. International Journal of Learner Corpus Research, 1(1), 96–129.
Baayen, R. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge University Press.
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.
Bestgen, Y. (2017). Beyond single-word measures: L2 writing assessment, lexical richness and formulaic competence. System, 691, 65–78.
Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., & Chodorow, M. (2014). ETS Corpus of Non-Native Written English LDC2014T06. Linguistic Data Consortium.
Browne, C., Culligan, B., & Phillips, J. (2013). The New General Service List. [URL]
Callies, M. (2015). Learner corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge Handbook of Learner Corpus Research (pp. 35–56). Cambridge University Press.
Callies, M., & Paquot, M. (2015). Learner Corpus Research: An interdisciplinary field on the move. International Journal of Learner Corpus Research, 1(1), 1–6.
Cambridge English Language Assessment (2012). Cambridge English: Preliminary and Preliminary for Schools Vocabulary List. [URL]
Centre for English Corpus Linguistics. (2019). Learner Corpora around the World. Université catholique de Louvain. Retrieved January, 2019, from [URL]
Cobb, T. (2018). Compleat Web VP [Computer software]. [URL]
Cobb, T. & Horst, M. (2015). Learner corpora and lexis. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge Handbook of Learner Corpus Research (pp. 185–206). Cambridge University Press.
Council of Europe (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Press Syndicate of the University of Cambridge.
Crossley, S. A., Salsbury, T., & Mcnamara, D. S. (2015). Assessing lexical proficiency using analytic ratings: A case for collocation accuracy. Applied Linguistics, 36(5), 570–590.
Daller, H., & Phelan, D. (2007). What is in a teacher’s mind? Teacher ratings of EFL essays and different aspects of lexical richness. In H. Daller, J. Milton, & J. Treffers-Daller (Eds.), Modelling and Assessing Vocabulary Knowledge (pp. 234–244). Cambridge University Press.
Daller, H., Turlik, J., & Weir, I. (2013). Vocabulary acquisition and the learning curve. In S. Jarvis & M. Daller (Eds.), Vocabulary Knowledge: Human Ratings and Automated Measures (pp. 185–218). John Benjamins.
Daller, H., van Hout, R., & Treffers-Daller, J. (2003). Lexical richness in the spontaneous speech of bilinguals. Applied Linguistics, 24(2), 197–222.
Daller, H., & Xue, H. (2007). Lexical richness and the oral proficiency of Chinese EFL students. In H. Daller, J. Milton, & J. Treffers-Daller (Eds.), Modelling and Assessing Vocabulary Knowledge (pp. 150–164). Cambridge University Press.
Davies, M. (2008–). The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. Retrieved October, 2018, from [URL] (accessed
DeKeyser, R. M. & Botana, G. P. (Eds.). (2019). Doing SLA Research with Implications for the Classroom: Reconciling Methodological Demands with Pedagogical Applicability. John Benjamins.
Duràn, P., Malvern, D., Richards, B., & Chipere, N. (2004). Developmental trends in lexical diversity. Applied Linguistics, 25(2), 220–242.
Dunlap, S. (2012). Orthographic Quality in English as a Second Language. [Unpublished doctoral dissertation]. University of Pittsburgh.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Exploring learner language through corpora: Comparing and interpreting corpus frequency information. Language Learning, 67(1), 130–154.
Geertzen, J., Alexopoulou, T., & Korhonen, A. (2013). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). Selected Proceedings of the 31st Second Language Research Forum (SLRF) (pp. 240–254). Cascadilla Press.
Gilquin, G. (2015). From design to collection of learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge Handbook of Learner Corpus Research (pp. 9–34). Cambridge University Press.
Granger, S., & Wynne, M. (1999). Optimising measures of lexical variation in EFL learner corpora. In J. Kirk (Ed.), Corpora Galore (pp. 249257). Rodopi.
Jarvis, S. (2013). Defining and measuring lexical diversity. In S. Jarvis & M. Daller (Eds.), Vocabulary Knowledge: Human Ratings and Automated Measures (pp. 13–44). John Benjamins.
Juffs, A. (2019). Lexical development in the writing of intensive English program students. In R. M. DeKeyser, & G. P. Botana (Eds.), Reconciling Methodological Demands with Pedagogical Applicability (pp. 179–200). John Benjamins.
Juffs, A., Han, N-R., & Naismith, B. (2020). PELIC: The University of Pittsburgh English Language Institute Corpus. Available online at [URL]
Kim, M. M., Crossley, S. A., & Kyle, K. (2018). Lexical sophistication as a multidimensional phenomenon: Relations to second language lexical proficiency, development, and writing quality. The Modern Language Journal, 102(1), 120–141.
Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2 written production. Applied Linguistics, 16(3), 307–322.
Levshina, N. (2015). How to do Linguistics with R: Data Exploration and Statistical Analysis. John Benjamins.
Lindqvist, C., Gudmundson, A., & Bardel, C. (2013). A new approach to measuring lexical sophistication in L2 oral production. In C. Bardel, C. Lindqvist, & B. Laufer (Eds.), EUROSLA Monographs Series 2 (pp. 109–126). European Second Language Association.
Malvern, D., Richards, B. J., Chipere, N., & Durán, P. (2004). Lexical Diversity and Language Development. Palgrave Macmillan.
Miller, D., & Biber, D. (2015). Evaluating reliability in quantitative vocabulary studies: The influence of corpus design and composition. International Journal of Corpus Linguistics, 20(1), 30–53.
Monteiro, K. R., Crossley, S. A., & Kyle, K. (2018). In search of new benchmarks: Using L2 lexical frequency and contextual diversity indices to assess second language writing. Applied Linguistics, 41(2), 1–22.
Mukherjee, J., & Rohrbach, J.-M. (2006). Rethinking applied corpus linguistics from a language-pedagogical perspective: New departures in learner corpus research. In B. Kettemann, & G. Marko (Eds.), Planing, Gluing and Painting Corpora: Inside the Applied Corpus Linguist’s Workshop (pp. 205–232). Peter Lang.
Naismith, B., Han, N.-R., Juffs, A., Hill, B. L., & Zheng, D. (2018). Accurate measurement of lexical sophistication with reference to ESL learner data. In K. E. Boyer & M. Yudelson (Eds.), Proceedings of the 11th International Conference on Educational Data Mining (pp 259–265). International Educational Data Mining Society. [URL]
Ortega, L. (2016). Multi-competence in second language acquisition: Inroads into the mainstream? In V. Cook & L. Wei (Eds.), The Cambridge Handbook of Linguistic Multi-Competence (pp. 50–76). Cambridge University Press.
Princeton University. (2010). WordNet Search – 3.1. WordNet. [URL]
R Core Team. (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL [URL]
Rampton, M. B. H. (1990). Displacing the ‘native speaker’: Expertise, affiliation, and inheritance. ELT Journal, 44(2), 97–101.
Rankin, T., & Schiftner, B. (2011). Marginal prepositions in learner English: Applying local corpus data. International Journal of Corpus Linguistics, 16(3), 412–434.
Schmitt, N., & Schmitt, D. (2014). A reassessment of frequency and vocabulary size in L2 vocabulary teaching. Language Teaching, 47(4), 484–503.
Speelman, D., Heylen, K., & Geeraerts, D. (Eds.). (2018). Mixed-Effects Regression Models in Linguistics. Springer.
Stewart, D., Bernardini, S., & Aston, G. (2004). Introduction: Ten years of TaLC. In D. Stewart, S. Bernardini, & G. Aston (Eds.), Corpora and Language Learners (pp. 1–18). John Benjamins.
Tidball, F., & Treffers-Daller, J. (2008). Analysing lexical richness in French learner language: What frequency lists and teacher judgements can tell us about basic and advanced words. Journal of French Language Studies, 18(3), 299–313.
Cited by (1)
Cited by one other publication
This list is based on CrossRef data as of 12 december 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.
